Find answers from the community

Updated 2 months ago

I'm using `reranker =

I'm using reranker = SentenceTransformerRerank(top_n=5, model="BAAI/bge-reranker-base"), as a global variable in my chat engine.

Now when I try to deploy my app via Render, it fails: Ran out of memory (used over 2GB) while running your code.

What are some best practices for storing / deploying models?
L
J
11 comments
this runs an embedding model locally

Ideally, this runs on a seperate server or an API (i.e. cohere)

This reminds me, I should add support for reranking with text-embedding-interface package
Thanks for writjng back @Logan M ! any recs moving forward? Notebooks? Should i just increase memory?

Sorry, new to this area!
should just increase memory for now (or use cohere reranking)

If I implement that text-embedding-interface package above, you could also deploy your own server to run a reranking endpoint
I'll be the first to try it!
For posterity, increasing the instance to Pro Plus 4 CPU 8 GB solved this issue.
Attachment
image.png
Although now it's extremely slow.
Attachment
image.png
Same question, ran locally on my 12CPU 32GB MBP (which is still v slow):
Attachment
image.png
I'm guessing the cpus aren't the highest quality

At the end of the day, you are running 1GB model on CPU. I see top_n=5, so I'm assuming the initial top k is even larger, which means it has to call this model several times (which without CUDA, will be slow)
Gotcha. Going to see how simple Reciprocal rank fusion compares for now
@Logan M -- is there a straightforward way to use Hugging Face hosted models with SentenceTransformerRerank?
hmm not really.

Easiest way would be to subclass and make your own node-postprocessor that does what you need

Its not too bad -- a single class method
https://docs.llamaindex.ai/en/stable/module_guides/querying/node_postprocessors/root.html#id2
Add a reply
Sign up and join the conversation on Discord