LlamaIndex

Log inLog into community

Find answers from the community

Updated 5 months ago

I've had to reduce `Settings.chunk_size`

I've had to reduce `Settings.chunk_size`

At a glance

The community member had to reduce the Settings.chunk_size and similarity_top_k in the query engine to get working what worked fairly well in v.0.9. Other community members suggested trying to reproduce the issue by running samples in v0.9.x vs. v0.10.x, and checking the environment settings. The community members discussed various configuration settings, such as building vector embeddings, configuring prompts, and initiating the LLaMA-2 model. They also discussed the memory usage and CUDA out of memory errors encountered during the inference operation. The community members suggested that the issue might be specific to the HuggingFace models and libraries being used, and that reducing the context_window size could have a significant impact on the memory footprint.

bbin4ry_d3struct0r

·

I've had to reduce Settings.chunk_size by half as well as similarity_top_k in the query engine to get working what worked fairly well in v.0.9.

L

b

22 comments

i haven't really noticed anything. But if you have a sample I can run in v0.9.x vs. v0.10.x I'm happy to try and reproduce

bbin4ry_d3struct0r

Hmm ... okay, maybe it's my environment settings. I'll check some things to see if it's reproducible, thanks!

Yea would really appreciate it! 🙏 These types of things are hard to track down otherwise

bbin4ry_d3struct0r

Plain Text

# Build vector embeddings.

embed_model = "sentence-transformers/all-mpnet-base-v2"

kwargs = {"device" : "cuda"}

embeddings = LangchainEmbedding(
    HuggingFaceEmbeddings(model_kwargs = kwargs, model_name = embed_model)
)

bbin4ry_d3struct0r

Plain Text

# Configure prompts.

query_prompt = SimpleInputPrompt("<|USER|>{query_str}<|ASSISTANT|>")

sys_prompt = "You are a Q&A assistant. Your job is to answer questions as accurately as possible based ONLY on context based on document knowledge. Never provide answers not found in the context!"

bbin4ry_d3struct0r

Plain Text

# Configure quantization settings.

bnb_config = BitsAndBytesConfig(load_in_8bit = True)

bbin4ry_d3struct0r

Plain Text

# Initiate LLaMA-2 model.

llama = HuggingFaceLLM(
    context_window       = 4096,
    device_map           = "auto",
    generate_kwargs      = {"do_sample" : True, "temperature" : 0.1},
    max_new_tokens       = 256,
    model_kwargs         = {"quantization_config" : bnb_config, "torch_dtype" : torch.float16},
    model_name           = "meta-llama/Llama-2-13b-chat-hf",
    query_wrapper_prompt = query_prompt,
    system_prompt        = sys_prompt,
    tokenizer_name       = "meta-llama/Llama-2-13b-chat-hf",
    tokenizer_kwargs     = {"max_length" : 4096},
)

bbin4ry_d3struct0r

Plain Text

# Initialize LlamaIndex global settings.

Settings.chunk_overlap = 20
Settings.chunk_size    = 512
Settings.embed_model   = embeddings
Settings.llm           = llama

bbin4ry_d3struct0r

Plain Text

# Initialize storage context.

storage = StorageContext.from_defaults(persist_dir = "./datasets/data")

bbin4ry_d3struct0r

Plain Text

# Load index from local storage.

index = load_index_from_storage(show_progress = True, storage_context = storage)

bbin4ry_d3struct0r

Plain Text

query_engine = index.as_query_engine(similarity_top_k = 3, streaming = True)

bbin4ry_d3struct0r

At this point, the amount of VRAM consumed is 13958MiB / 23028MiB or around 14GB.

bbin4ry_d3struct0r

Inference operation:

Plain Text

answer = query_engine.query("Who are the Teenage Mutant Ninja Turtles?")

answer.print_response_stream()

bbin4ry_d3struct0r

This fails with the error message:

Plain Text

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 36.00 MiB. GPU 0 has a total capacity of 21.99 GiB of which 31.00 MiB is free. Process 10182 has 21.95 GiB memory in use. Of the allocated memory 18.37 GiB is allocated by PyTorch, and 3.26 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.

bbin4ry_d3struct0r

Upon failure, VRAM consumption is 22484MiB / 23028MiB

(So far, nothing here is specific to anything that changed in v0.10.x)

This seems mostly specific to huggingface + the models being used here

The default embed_batch_size is 10, which lowering will lower embedding memory usage

For local LLMs, memory is allocated as it reads longer and longer sequences (up to 4096 in this case). So on load, it will allocate some initial memory, and that will keep growing as it sees inputs that are longer, until it sees at least one input sequence that was 4096 tokens

bbin4ry_d3struct0r

Hmm ...

bbin4ry_d3struct0r

Wonder why this was never a problem in v.0.9 🤔

bbin4ry_d3struct0r

Okay, thanks bunches for your feedback!

maybe new data? Or a change in langchain/transformers/bitasandbytes?

bbin4ry_d3struct0r

Well, reducing the context_window size seems to have a much more significant impact on the memory footprint than chunk_size and similarity_top_k

That too -- that will limit how much context the framework thinks it can send for a given llm call

Add a reply

Sign up and join the conversation on Discord