Find answers from the community

Updated 8 months ago

hi guys so I am trying to use the new

hi guys so I am trying to use the new Command R 4 bit model with LlamaIndex. my machine uses the model just fine using transformers code from HF, but when I tried to wrap it in LlamaIndex it is giving OOM
this is my LlamaIndex code
Plain Text
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.core import Settings
from llama_index.core import PromptTemplate
import torch

# # This will wrap the default prompts that are internal to llama-index
query_wrapper_prompt = PromptTemplate("<BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>{query_str}<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>")

llm = HuggingFaceLLM(
    context_window=16384,
    max_new_tokens=4096,
    generate_kwargs={"temperature": 0.7, "do_sample": True},
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name="CohereForAI/c4ai-command-r-v01-4bit",
    model_name="CohereForAI/c4ai-command-r-v01-4bit",
    device_map="auto",
    # tokenizer_kwargs={"max_length": 4096},
    # uncomment this if using CUDA to reduce memory usage
    # model_kwargs={"torch_dtype": torch.float16}
)

Settings.llm = llm
Settings.chunk_size = 1024
documents = SimpleDirectoryReader('data').load_data()
index = VectorStoreIndex.from_documents(documents)

query_engine = index.as_query_engine()
print(query_engine.query("Could you summarize the given context in 3 paragraphs? Return your response which covers the key points of the text and does not miss anything important, please."))

the error message
Plain Text
ValueError: 
                    Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the
                    quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules
                    in 32-bit, you need to set `load_in_8bit_fp32_cpu_offload=True` and pass a custom `device_map` to
                    `from_pretrained`.
L
a
12 comments
When you use it outside of llama-index, I'm guessing you are just giving it small inputs?

LlamaIndex will give the model large inputs. And with local pytorch models like this, memory is allocated lazily. If it sees an input bigger than it saw previously, it will allocate new memory to handle that input. This keeps happening until it sees the largest input possible (16384 in this case)
I suspect you'd also get OOM in pure huggingface if you prompted the model with a large input
does that mean lowering from 16384 to something like 2048 would fix this? I tried but it did not work
It could. If that didn't fix it though, I suspect you might not have enough VRAM to run this model?
uhm I know I do have enough VRAM to run since this worked
Plain Text
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "CohereForAI/c4ai-command-r-v01-4bit"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

# Format message with the command-r chat template
messages = [{"role": "user", "content": "Hello, how are you?"}]
input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
## <BOS_TOKEN><|START_OF_TURN_TOKEN|><|USER_TOKEN|>Hello, how are you?<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>



gen_tokens = model.generate(
    input_ids, 
    max_new_tokens=2048, 
    do_sample=True, 
    temperature=0.3,
    )

gen_text = tokenizer.decode(gen_tokens[0])
print(gen_text)
just do not know how to wrap that inside LLaMaIndex
That works because it's a tiny prompt.

Try something bigger
Like a couple of paragraphs as a prompt at least
Something closer to 2000 tokens
The bigger the prompt, the more memory that gets used (as explained above)
ah so I also try to use that model via Ollama and I gave it like whole essays and it still runs
Ollama is way more optimized than pure pytorch tbh
Add a reply
Sign up and join the conversation on Discord