Find answers from the community

Updated 5 months ago

I am using local models for a Q&A RAG

At a glance

The community member is using local models for a Q&A RAG pipeline and is trying to use multiprocessing to optimize resource usage. However, they are facing an issue where they have to load models for each process separately, which slows things down. They tried using a multiprocessing queue to share the service_context among processes, but encountered an error: "cannot pickle 'builtins.CoreBPE' object".

The comments suggest that the solution is to host the model and send requests to a queue for processing. Community members recommend using a model server like vLLM or Hugging Face's text-generation-interface, and provide suggestions to use Langchain's LLM for the text-generation-interface.

I am using local models for a Q&A RAG pipeline. I am trying to use multiprocessing to get the best use of my resources. I am able to make it work but I have to load models for each process separately but that slows things down. I tried using multiprocessing queue to share the service_context among processes but got this error:
cannot pickle 'builtins.CoreBPE' object

Any advice?
L
A
5 comments
Local models have the limitation where you can't share across services like that

Solution is probably to host the model and send requests to a queue for processing
That is something I have not done before. Can you direct me to where I can find the documentation for the proposed solution? or libraries I can use?
Need to use a model server like vLLM or Huggingfaces text-generatation-interface
We have a vLLM integration. For text-generation-interface, you can use langchains LLM
Add a reply
Sign up and join the conversation on Discord