Hi there. We have been using LlamaIndex

ttoshiko

Hi there. We have been using LlamaIndex to develop a RAG pipeline, and recently realized that our system slows down over time, which can be resolved by system reboot. GPU logs show a sign of memory leakage. I wonder if there are some common mistakes people make to cause this kind of issue? What would be a good way to solve? Any advice would be much appreciated!

9 comments

WWhiteFang_Jr

It could be one of these reasons:

Are you indexing large amount of data and keeping it in memory?
Using a local LLM?

As having a large indexing in local memory can also cause slowness and same goes for LLM.

If it is the first case try using vector sctores like Chroma/Pinecone/Weaviate etc

For second, I guess you'll have to buy some more Gigs😅

LLogan M

If GPU memory is going up, it sounds like there should be a torch cuda cache clear at some point to clean up the GPU memory? Also curious what LLM/embeddings you are using

ttoshiko

@WhiteFang_Jr @Logan M
Thanks for the quick replies. As for clearing cuda cache, I will try torch.cuda.empty_cache().

Apologies for not providing the details of our implementations.

llama-index: 0.8.61
LLM: Llama-2-13b-chat
Embedding model: none
Vector Store: none
KnowledgeGraphIndex: neo4j 5.14.1
CUDA version: 11.4
GPU: T4 15GB VRAM x 4 (60GB VRAM in total)

Would there be anything we can look into to solve the memory leakage issue?

LLogan M

What LLM class were you using to load the the llm, out of curiosity?

ttoshiko

HuggingFaceLLM

ttoshiko

@Logan M
Does this class try to connect to outside of the closed environment? I mean, our system has to be in the closed environment, but for some reason if we deploy our system in a closed environment, it won't run due to the lack of connection to the Internet.

LLogan M

For whatever reason AutoModel.from_pretrained() tries to ping the internet to refresh its cache.

But, if you download the model and tokenizer to a folder, and provide the folder path, it should be fine

LLogan M

Tbh for production deployments (and for GPU resources that large), you might also want to look into using something like vLLM

ttoshiko

@Logan M We found that there was one part where the system was trying to load the base model from cache, instead of the local folder. The issue is resolved, thanks so much!

Thanks also for your recommendations to vLLM. We will look into that!

Add a reply

Find answers from the community

Hi there. We have been using LlamaIndex