Find answers from the community

Updated 5 months ago

I've had to reduce `Settings.chunk_size`

At a glance

The community member had to reduce the Settings.chunk_size and similarity_top_k in the query engine to get working what worked fairly well in v.0.9. Other community members suggested trying to reproduce the issue by running samples in v0.9.x vs. v0.10.x, and checking the environment settings. The community members discussed various configuration settings, such as building vector embeddings, configuring prompts, and initiating the LLaMA-2 model. They also discussed the memory usage and CUDA out of memory errors encountered during the inference operation. The community members suggested that the issue might be specific to the HuggingFace models and libraries being used, and that reducing the context_window size could have a significant impact on the memory footprint.

I've had to reduce Settings.chunk_size by half as well as similarity_top_k in the query engine to get working what worked fairly well in v.0.9.
L
b
22 comments
i haven't really noticed anything. But if you have a sample I can run in v0.9.x vs. v0.10.x I'm happy to try and reproduce
Hmm ... okay, maybe it's my environment settings. I'll check some things to see if it's reproducible, thanks!
Yea would really appreciate it! πŸ™ These types of things are hard to track down otherwise
Plain Text
# Build vector embeddings.

embed_model = "sentence-transformers/all-mpnet-base-v2"

kwargs = {"device" : "cuda"}

embeddings = LangchainEmbedding(
    HuggingFaceEmbeddings(model_kwargs = kwargs, model_name = embed_model)
)
Plain Text
# Configure prompts.

query_prompt = SimpleInputPrompt("<|USER|>{query_str}<|ASSISTANT|>")

sys_prompt = "You are a Q&A assistant. Your job is to answer questions as accurately as possible based ONLY on context based on document knowledge. Never provide answers not found in the context!"
Plain Text
# Configure quantization settings.

bnb_config = BitsAndBytesConfig(load_in_8bit = True)
Plain Text
# Initiate LLaMA-2 model.

llama = HuggingFaceLLM(
    context_window       = 4096,
    device_map           = "auto",
    generate_kwargs      = {"do_sample" : True, "temperature" : 0.1},
    max_new_tokens       = 256,
    model_kwargs         = {"quantization_config" : bnb_config, "torch_dtype" : torch.float16},
    model_name           = "meta-llama/Llama-2-13b-chat-hf",
    query_wrapper_prompt = query_prompt,
    system_prompt        = sys_prompt,
    tokenizer_name       = "meta-llama/Llama-2-13b-chat-hf",
    tokenizer_kwargs     = {"max_length" : 4096},
)
Plain Text
# Initialize LlamaIndex global settings.

Settings.chunk_overlap = 20
Settings.chunk_size    = 512
Settings.embed_model   = embeddings
Settings.llm           = llama
Plain Text
# Initialize storage context.

storage = StorageContext.from_defaults(persist_dir = "./datasets/data")
Plain Text
# Load index from local storage.

index = load_index_from_storage(show_progress = True, storage_context = storage)
Plain Text
query_engine = index.as_query_engine(similarity_top_k = 3, streaming = True)
At this point, the amount of VRAM consumed is 13958MiB / 23028MiB or around 14GB.
Inference operation:

Plain Text
answer = query_engine.query("Who are the Teenage Mutant Ninja Turtles?")

answer.print_response_stream()
This fails with the error message:

Plain Text
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 36.00 MiB. GPU 0 has a total capacity of 21.99 GiB of which 31.00 MiB is free. Process 10182 has 21.95 GiB memory in use. Of the allocated memory 18.37 GiB is allocated by PyTorch, and 3.26 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.
Upon failure, VRAM consumption is 22484MiB / 23028MiB
(So far, nothing here is specific to anything that changed in v0.10.x)

This seems mostly specific to huggingface + the models being used here

The default embed_batch_size is 10, which lowering will lower embedding memory usage

For local LLMs, memory is allocated as it reads longer and longer sequences (up to 4096 in this case). So on load, it will allocate some initial memory, and that will keep growing as it sees inputs that are longer, until it sees at least one input sequence that was 4096 tokens
Wonder why this was never a problem in v.0.9 πŸ€”
Okay, thanks bunches for your feedback!
maybe new data? Or a change in langchain/transformers/bitasandbytes?
Well, reducing the context_window size seems to have a much more significant impact on the memory footprint than chunk_size and similarity_top_k
That too -- that will limit how much context the framework thinks it can send for a given llm call
Add a reply
Sign up and join the conversation on Discord