While using an ingestion pipeline that ingest in a qdrant vector store I'm having problems with the GPU VRAM. The vectors are store in the qdrant but the gpu memory does not flush until the python process is killed
a) you could lower the batch size b) you could lower the max-length (or chunk size), since E5 has a rather large max input size. Memory for the model lazily allocates on the fly (i.e. an input sequence of 8 tokens will only allocate memory for those 8 tokens. If the next sequence was 16 tokens, an additional 8 tokens of memory usage would be allocated)
I could see E5 using 80GB with both a large batch size and large inputs