Find answers from the community

Updated last year

I m attempting to use

I'm attempting to use:
Plain Text
llm = HuggingFaceLLM(
...
tokenizer_name="meta-llama/Llama-2-7b-chat-hf",
model_name="meta-llama/Llama-2-7b-chat-hf",
...)


But I keep getting an error: "ValueError: Need either a state_dict or a save_folder containing offloaded weights.". I've tried specifying an empty save_folder right in the HuggingFaceLLM() call, but that's an unexpected keyword, and I've also tried adding it to generate_kwargs={} and tokenizer_kwargs={} without success. I suspect it's not just looking for a blank folder, either. Any ideas?
L
c
13 comments
If you are getting that warning, it's not even worth figuring out πŸ˜…

That means you don't have enough memory to fit the model, so it will load/offload to and from disk as the model runs.

This process is extremely slow. Like, hours for a response πŸ™ƒ
Oh, this must be a function of using the HuggingFaceLLM() interface? I'm using the Llama-2-13B-chat model with very good speeds via LM Studio on the same machine.
I suppose it is different actually
q4_k_s GGML
So that's important
Yea! If you want to run on GGML stuff, I would check out llama.cpp

https://python.langchain.com/docs/integrations/llms/llamacpp
We are working on our own integration, but for now the langchain version also works

Plain Text
from llama_index.llms import LangChainLLM

llm = LangChanLLM(LlamaCpp(...))
service_context = ServiceContext.from_defaults(llm=llm)
Although it sounds like you have enough resources to run using huggingface as well πŸ™‚
OK, I will check that out. I'm trying to go the mainstream routes (with the exception of using a local model) since I am clearly bumbling through this a bit.
It doesn't seem I have the resources for HF though if I'm getting this error? Unless there are quantized versions of the HF models...
There are quantized versions out there I think πŸ€” But llama.cpp compiled with GPU support also worked ok in my testing, just a bit more setup I guess
OK, I have it producing responses via the LangChainLLM(LlamaCpp()) interface, though it seems to be largely nonsense. I will have to figure out how to do the proper prompt crafting across these interfaces
Llama 2 requires some pretty specific structure. For example, we have these util functions here
https://github.com/jerryjliu/llama_index/blob/main/llama_index/llms/llama_utils.py

You may have to wrap llama cpp in a custom llm layer so that you have control over how stuff gets formatted before prediction
https://gpt-index.readthedocs.io/en/latest/core_modules/model_modules/llms/usage_custom.html#example-using-a-custom-llm-model-advanced
So many layers haha
Add a reply
Sign up and join the conversation on Discord