hi guys so I am trying to use the new

At a glance

A community member is trying to use the Command R 4-bit model with LlamaIndex, but is encountering an Out-of-Memory (OOM) error. The community member's code works fine when using the model directly with Transformers, but when wrapped in LlamaIndex, it gives the OOM error. The community members discuss that LlamaIndex may be providing larger inputs to the model, causing the memory issue. Reducing the context window size may help, but it did not work for the community member. They also mention that they have enough VRAM to run the model, as it works with smaller inputs. The community members suggest that the issue may be related to the size of the prompt, and that a more optimized library like Ollama may handle larger prompts better than pure PyTorch.

aanjin

hi guys so I am trying to use the new Command R 4 bit model with LlamaIndex. my machine uses the model just fine using transformers code from HF, but when I tried to wrap it in LlamaIndex it is giving OOM
this is my LlamaIndex code

Plain Text

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.core import Settings
from llama_index.core import PromptTemplate
import torch

# # This will wrap the default prompts that are internal to llama-index
query_wrapper_prompt = PromptTemplate("<BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>{query_str}<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>")

llm = HuggingFaceLLM(
    context_window=16384,
    max_new_tokens=4096,
    generate_kwargs={"temperature": 0.7, "do_sample": True},
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name="CohereForAI/c4ai-command-r-v01-4bit",
    model_name="CohereForAI/c4ai-command-r-v01-4bit",
    device_map="auto",
    # tokenizer_kwargs={"max_length": 4096},
    # uncomment this if using CUDA to reduce memory usage
    # model_kwargs={"torch_dtype": torch.float16}
)

Settings.llm = llm
Settings.chunk_size = 1024
documents = SimpleDirectoryReader('data').load_data()
index = VectorStoreIndex.from_documents(documents)

query_engine = index.as_query_engine()
print(query_engine.query("Could you summarize the given context in 3 paragraphs? Return your response which covers the key points of the text and does not miss anything important, please."))

the error message

Plain Text

ValueError: 
                    Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the
                    quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules
                    in 32-bit, you need to set `load_in_8bit_fp32_cpu_offload=True` and pass a custom `device_map` to
                    `from_pretrained`.

12 comments

LLogan M

When you use it outside of llama-index, I'm guessing you are just giving it small inputs?

LlamaIndex will give the model large inputs. And with local pytorch models like this, memory is allocated lazily. If it sees an input bigger than it saw previously, it will allocate new memory to handle that input. This keeps happening until it sees the largest input possible (16384 in this case)

LLogan M

I suspect you'd also get OOM in pure huggingface if you prompted the model with a large input

aanjin

does that mean lowering from 16384 to something like 2048 would fix this? I tried but it did not work

LLogan M

It could. If that didn't fix it though, I suspect you might not have enough VRAM to run this model?

aanjin

uhm I know I do have enough VRAM to run since this worked

Plain Text

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "CohereForAI/c4ai-command-r-v01-4bit"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

# Format message with the command-r chat template
messages = [{"role": "user", "content": "Hello, how are you?"}]
input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
## <BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Hello, how are you?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>



gen_tokens = model.generate(
    input_ids, 
    max_new_tokens=2048, 
    do_sample=True, 
    temperature=0.3,
    )

gen_text = tokenizer.decode(gen_tokens[0])
print(gen_text)

aanjin

just do not know how to wrap that inside LLaMaIndex

LLogan M

That works because it's a tiny prompt.

Try something bigger

LLogan M

Like a couple of paragraphs as a prompt at least

LLogan M

Something closer to 2000 tokens

LLogan M

The bigger the prompt, the more memory that gets used (as explained above)

aanjin

ah so I also try to use that model via Ollama and I gave it like whole essays and it still runs

LLogan M

Ollama is way more optimized than pure pytorch tbh

Add a reply

Find answers from the community

hi guys so I am trying to use the new