GPU

iishaand2

I am trying to check why my GPU memory utilisation is so high when ingesting documents into the qdrant vector store.
Despite calling .persist() , it appears that the full set of embeddings are being auto-loaded in memory.

Is there a way to unload/offload the vector store (indices) from & to GPU memory?

13 comments

LLogan M

Vectors are not stored on gpu...

What kind of ingestion setup do you have? Are you using a local embedding model?

iishaand2

Yes, I'm using a local embedding model.
The setup (privateGPT) uses BaseIndex I believe to initialise an manage an index..

Ah , I figured the index loads in video memory as well - since its utilisation is increased by a considerable amount even after the ingestion is done

LLogan M

Yea so the only thing using GPU memory here would be the embedding model and LLM, since they are running locally

If you aren't using a vector store integration (qrant, weaviate, chroma, etc), then the emeddings are in memory on RAM (but not VRAM)

iishaand2

I believe by default PrivateGPT uses Qdrant... Is there a quick way to check whether it's being used?
I just put some logs and it does seem to be initialising with the props-

location=None, url=None, port=6333, grpc_port=6334, prefer_grpc=False, https=None, api_key=None, prefix=None, timeout=None, host=None, path='local_data/private_gpt/qdrant', force_disable_check_same_thread=True

iishaand2

Also, any info about where GPU memory is used in the RAG stack would be appreciated.
Sorry I'm far from an expert at this - really appreciate the help

LLogan M

Yea no worries.

So during ingestion, the embedding model will embed your data. The gpu memory usage is mainly controlled by the batch size of the embedding model (the default is 10)

During querying, the embedding model also embedding the query (using your gpu), and then similar nodes are retrieved.

Then, the LLM takes the nodes and the query, and returns a response (using gpu memory).

LLORKA

While using an ingestion pipeline that ingest in a qdrant vector store I'm having problems with the GPU VRAM. The vectors are store in the qdrant but the gpu memory does not flush until the python process is killed

LLogan M

Are you using a local embedding model? Thats pretty classic for torch/huggingface

LLogan M

You can try

Plain Text

import torch
torch.cuda.empty_cache()

LLogan M

But tbh I always have trouble clearing memory for local torch models

LLORKA

I'm using a hugginface model

LLogan M

yea, huggingface uses torch by default

LLogan M

The above may help

Add a reply

Find answers from the community

GPU