I am using local models for a Q&A RAG

At a glance

The community member is using local models for a Q&A RAG pipeline and is trying to use multiprocessing to optimize resource usage. However, they are facing an issue where they have to load models for each process separately, which slows things down. They tried using a multiprocessing queue to share the service_context among processes, but encountered an error: "cannot pickle 'builtins.CoreBPE' object".

The comments suggest that the solution is to host the model and send requests to a queue for processing. Community members recommend using a model server like vLLM or Hugging Face's text-generation-interface, and provide suggestions to use Langchain's LLM for the text-generation-interface.

AAnurag Agrawal

I am using local models for a Q&A RAG pipeline. I am trying to use multiprocessing to get the best use of my resources. I am able to make it work but I have to load models for each process separately but that slows things down. I tried using multiprocessing queue to share the service_context among processes but got this error:
cannot pickle 'builtins.CoreBPE' object

Any advice?

5 comments

LLogan M

Local models have the limitation where you can't share across services like that

Solution is probably to host the model and send requests to a queue for processing

AAnurag Agrawal

That is something I have not done before. Can you direct me to where I can find the documentation for the proposed solution? or libraries I can use?

AAnurag Agrawal

@Logan M

LLogan M

Need to use a model server like vLLM or Huggingfaces text-generatation-interface

LLogan M

We have a vLLM integration. For text-generation-interface, you can use langchains LLM

Add a reply

Find answers from the community

I am using local models for a Q&A RAG