I m attempting to use

At a glance

I'm attempting to use:

Plain Text

llm = HuggingFaceLLM(
...
tokenizer_name="meta-llama/Llama-2-7b-chat-hf",
model_name="meta-llama/Llama-2-7b-chat-hf",
...)

But I keep getting an error: "ValueError: Need either a state_dict or a save_folder containing offloaded weights.". I've tried specifying an empty save_folder right in the HuggingFaceLLM() call, but that's an unexpected keyword, and I've also tried adding it to generate_kwargs={} and tokenizer_kwargs={} without success. I suspect it's not just looking for a blank folder, either. Any ideas?

13 comments

LLogan M

If you are getting that warning, it's not even worth figuring out 😅

That means you don't have enough memory to fit the model, so it will load/offload to and from disk as the model runs.

This process is extremely slow. Like, hours for a response 🙃

ccodydh

Oh, this must be a function of using the HuggingFaceLLM() interface? I'm using the Llama-2-13B-chat model with very good speeds via LM Studio on the same machine.

ccodydh

I suppose it is different actually

ccodydh

q4_k_s GGML

ccodydh

So that's important

LLogan M

Yea! If you want to run on GGML stuff, I would check out llama.cpp

https://python.langchain.com/docs/integrations/llms/llamacpp
We are working on our own integration, but for now the langchain version also works

Plain Text

from llama_index.llms import LangChainLLM

llm = LangChanLLM(LlamaCpp(...))
service_context = ServiceContext.from_defaults(llm=llm)

LLogan M

Although it sounds like you have enough resources to run using huggingface as well 🙂

ccodydh

OK, I will check that out. I'm trying to go the mainstream routes (with the exception of using a local model) since I am clearly bumbling through this a bit.

ccodydh

It doesn't seem I have the resources for HF though if I'm getting this error? Unless there are quantized versions of the HF models...

LLogan M

There are quantized versions out there I think 🤔 But llama.cpp compiled with GPU support also worked ok in my testing, just a bit more setup I guess

ccodydh

OK, I have it producing responses via the LangChainLLM(LlamaCpp()) interface, though it seems to be largely nonsense. I will have to figure out how to do the proper prompt crafting across these interfaces

LLogan M

Llama 2 requires some pretty specific structure. For example, we have these util functions here
https://github.com/jerryjliu/llama_index/blob/main/llama_index/llms/llama_utils.py

You may have to wrap llama cpp in a custom llm layer so that you have control over how stuff gets formatted before prediction
https://gpt-index.readthedocs.io/en/latest/core_modules/model_modules/llms/usage_custom.html#example-using-a-custom-llm-model-advanced

LLogan M

So many layers haha

Add a reply

Find answers from the community

I m attempting to use