Find answers from the community

Updated 10 months ago

GPU

I am trying to check why my GPU memory utilisation is so high when ingesting documents into the qdrant vector store.
Despite calling .persist() , it appears that the full set of embeddings are being auto-loaded in memory.

Is there a way to unload/offload the vector store (indices) from & to GPU memory?
L
i
L
13 comments
Vectors are not stored on gpu...

What kind of ingestion setup do you have? Are you using a local embedding model?
Yes, I'm using a local embedding model.
The setup (privateGPT) uses BaseIndex I believe to initialise an manage an index..

Ah , I figured the index loads in video memory as well - since its utilisation is increased by a considerable amount even after the ingestion is done
Yea so the only thing using GPU memory here would be the embedding model and LLM, since they are running locally

If you aren't using a vector store integration (qrant, weaviate, chroma, etc), then the emeddings are in memory on RAM (but not VRAM)
I believe by default PrivateGPT uses Qdrant... Is there a quick way to check whether it's being used?
I just put some logs and it does seem to be initialising with the props-
location=None, url=None, port=6333, grpc_port=6334, prefer_grpc=False, https=None, api_key=None, prefix=None, timeout=None, host=None, path='local_data/private_gpt/qdrant', force_disable_check_same_thread=True
Also, any info about where GPU memory is used in the RAG stack would be appreciated.
Sorry I'm far from an expert at this - really appreciate the help
Yea no worries.

So during ingestion, the embedding model will embed your data. The gpu memory usage is mainly controlled by the batch size of the embedding model (the default is 10)

During querying, the embedding model also embedding the query (using your gpu), and then similar nodes are retrieved.

Then, the LLM takes the nodes and the query, and returns a response (using gpu memory).
While using an ingestion pipeline that ingest in a qdrant vector store I'm having problems with the GPU VRAM. The vectors are store in the qdrant but the gpu memory does not flush until the python process is killed
Are you using a local embedding model? Thats pretty classic for torch/huggingface
You can try

Plain Text
import torch
torch.cuda.empty_cache()
But tbh I always have trouble clearing memory for local torch models
I'm using a hugginface model
yea, huggingface uses torch by default
The above may help
Add a reply
Sign up and join the conversation on Discord