Memory

At a glance

Hello, I have a very specific question :
I have a rag system (using streamlit for the front and flask api and llama index (serving llm using llamaindex ollama integration) ) and when I use my tool for some time (ask let's say a couple of questions (7 to 8 )), I noticed that my Cuda memory gets saturated pretty quickly, I was wondering what could be causing this, if it's due to the context of my previous questions not being purged from the memory after generating a response, or is it because of some other reason :
I'm using chat_engine = index.as_chat_engine(chat_mode="condense_question", verbose=True), and using chat_engine.reset() to reset it and also tried torch.cuda.empty_cache() to empty the cache and it doesn't work.

LLM : llama3 70b-instruct (4bit quant)
Embeddings : BAAI/bge-base-en-v1.5 (using Huggingface embeddings integration with llama index)

Attachments

12 comments

LLogan M

It could be a bug with ollama?

You could also try lowering the batch size on your embedding model

HuggingFaceEmbedding(..., embed_batch_size=2)

LLogan M

Actually I think I know

LLogan M

Llama 3 70b is a huge model. And typically with models, memory is allocated as the model sees longer input sequences

So if the first input to the llm is 100 tokens, it allocates memory for 200 tokens.

If the next llm input is 200 tokens, it allocates memory for an additional 100 tokens

If the third request is 100 tokens, no new memory is allocated.

Basically, this keeps happening until the llm reaches its max context size

LLogan M

The solution here might be to limit the token limit of the memory buffer

gghxsted.

so based on this example the LLM will keep allocating more and more memory for each request, is there a way to control it ? I'm sure this is outside the scope of llama index but I can really use some insights about this

gghxsted.

I'm really not sure what exactly this means, do you mean limit the number of tokens for the requests the llm gets so that it allocates less memory for each request, wouldn't that just delay the saturation of the memory?

LLogan M

It allocates until at least one input passes through that reaches the max context of the llm. This is true for any LLM i've ever used.

LLogan M

The chat engine here has a memory buffer, you can manually pass it in and set a token_limit

gghxsted.

I looked a bit into the documentation and found out that this is the way to define memory and manage the history:

Plain Text

memory = ChatMemoryBuffer.from_defaults(token_limit=1500)

chat_engine = index.as_chat_engine(
    chat_mode="condense_plus_context",
    memory=memory,
    context_prompt=(
        "You are a chatbot, able to have normal interactions"
        "Here are the relevant documents for the context:\n"
        "{context_str}"
        "\nInstruction: Based on the above documents, provide a detailed answer for the user question below."
    ),
    verbose=True,
)

However it still saturates the memory for me after I tried it (saturated after like 7 to 9 questions), the context length of the LLM is 2k (I used a modelfile to change it for llama 3 70b ollama)

gghxsted.

I was wondering if there's a way to not have history at all, since what i'm building is not a chatbot that needs to have history of conversation, it's only a QA system, where the only context needed is the retrieved text from the database.
so the flow I think will be simpler:
question -> retrieve relevant chunks -> combine and send to llm -> get response (then i save it to an sql db) -> delete the retrieved chunks and response from memory

gghxsted.

@Logan M can you please share some insights about this if you have, I can't really find anywhere else to solve this issue

LLogan M

If you don't need chat history, it's better to use index.as_query_engine()

Add a reply

Find answers from the community

Memory