Find answers from the community

Updated 6 months ago

I'm using `reranker =

At a glance

I'm using reranker = SentenceTransformerRerank(top_n=5, model="BAAI/bge-reranker-base"), as a global variable in my chat engine.

Now when I try to deploy my app via Render, it fails: Ran out of memory (used over 2GB) while running your code.

What are some best practices for storing / deploying models?

11 comments

LLogan M

this runs an embedding model locally

Ideally, this runs on a seperate server or an API (i.e. cohere)

This reminds me, I should add support for reranking with text-embedding-interface package

JJoshhhh

Thanks for writjng back @Logan M ! any recs moving forward? Notebooks? Should i just increase memory?

Sorry, new to this area!

LLogan M

should just increase memory for now (or use cohere reranking)

If I implement that text-embedding-interface package above, you could also deploy your own server to run a reranking endpoint

JJoshhhh

I'll be the first to try it!

JJoshhhh

For posterity, increasing the instance to Pro Plus 4 CPU 8 GB solved this issue.

Attachment

JJoshhhh

Although now it's extremely slow.

Attachment

JJoshhhh

Same question, ran locally on my 12CPU 32GB MBP (which is still v slow):

Attachment

LLogan M

I'm guessing the cpus aren't the highest quality

At the end of the day, you are running 1GB model on CPU. I see top_n=5, so I'm assuming the initial top k is even larger, which means it has to call this model several times (which without CUDA, will be slow)

JJoshhhh

Gotcha. Going to see how simple Reciprocal rank fusion compares for now

JJoshhhh

@Logan M -- is there a straightforward way to use Hugging Face hosted models with SentenceTransformerRerank?

LLogan M

hmm not really.

Easiest way would be to subclass and make your own node-postprocessor that does what you need

Its not too bad -- a single class method
https://docs.llamaindex.ai/en/stable/module_guides/querying/node_postprocessors/root.html#id2

Add a reply