Find answers from the community

Updated 2 months ago

I've had to reduce `Settings.chunk_size`

I've had to reduce Settings.chunk_size by half as well as similarity_top_k in the query engine to get working what worked fairly well in v.0.9.
L
b
22 comments
i haven't really noticed anything. But if you have a sample I can run in v0.9.x vs. v0.10.x I'm happy to try and reproduce
Hmm ... okay, maybe it's my environment settings. I'll check some things to see if it's reproducible, thanks!
Yea would really appreciate it! πŸ™ These types of things are hard to track down otherwise
Plain Text
# Build vector embeddings.

embed_model = "sentence-transformers/all-mpnet-base-v2"

kwargs = {"device" : "cuda"}

embeddings = LangchainEmbedding(
    HuggingFaceEmbeddings(model_kwargs = kwargs, model_name = embed_model)
)
Plain Text
# Configure prompts.

query_prompt = SimpleInputPrompt("<|USER|>{query_str}<|ASSISTANT|>")

sys_prompt = "You are a Q&A assistant. Your job is to answer questions as accurately as possible based ONLY on context based on document knowledge. Never provide answers not found in the context!"
Plain Text
# Configure quantization settings.

bnb_config = BitsAndBytesConfig(load_in_8bit = True)
Plain Text
# Initiate LLaMA-2 model.

llama = HuggingFaceLLM(
    context_window       = 4096,
    device_map           = "auto",
    generate_kwargs      = {"do_sample" : True, "temperature" : 0.1},
    max_new_tokens       = 256,
    model_kwargs         = {"quantization_config" : bnb_config, "torch_dtype" : torch.float16},
    model_name           = "meta-llama/Llama-2-13b-chat-hf",
    query_wrapper_prompt = query_prompt,
    system_prompt        = sys_prompt,
    tokenizer_name       = "meta-llama/Llama-2-13b-chat-hf",
    tokenizer_kwargs     = {"max_length" : 4096},
)
Plain Text
# Initialize LlamaIndex global settings.

Settings.chunk_overlap = 20
Settings.chunk_size    = 512
Settings.embed_model   = embeddings
Settings.llm           = llama
Plain Text
# Initialize storage context.

storage = StorageContext.from_defaults(persist_dir = "./datasets/data")
Plain Text
# Load index from local storage.

index = load_index_from_storage(show_progress = True, storage_context = storage)
Plain Text
query_engine = index.as_query_engine(similarity_top_k = 3, streaming = True)
At this point, the amount of VRAM consumed is 13958MiB / 23028MiB or around 14GB.
Inference operation:

Plain Text
answer = query_engine.query("Who are the Teenage Mutant Ninja Turtles?")

answer.print_response_stream()
This fails with the error message:

Plain Text
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 36.00 MiB. GPU 0 has a total capacity of 21.99 GiB of which 31.00 MiB is free. Process 10182 has 21.95 GiB memory in use. Of the allocated memory 18.37 GiB is allocated by PyTorch, and 3.26 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.
Upon failure, VRAM consumption is 22484MiB / 23028MiB
(So far, nothing here is specific to anything that changed in v0.10.x)

This seems mostly specific to huggingface + the models being used here

The default embed_batch_size is 10, which lowering will lower embedding memory usage

For local LLMs, memory is allocated as it reads longer and longer sequences (up to 4096 in this case). So on load, it will allocate some initial memory, and that will keep growing as it sees inputs that are longer, until it sees at least one input sequence that was 4096 tokens
Wonder why this was never a problem in v.0.9 πŸ€”
Okay, thanks bunches for your feedback!
maybe new data? Or a change in langchain/transformers/bitasandbytes?
Well, reducing the context_window size seems to have a much more significant impact on the memory footprint than chunk_size and similarity_top_k
That too -- that will limit how much context the framework thinks it can send for a given llm call
Add a reply
Sign up and join the conversation on Discord