The community member is using local models for a Q&A RAG pipeline and is trying to use multiprocessing to optimize resource usage. However, they are facing an issue where they have to load models for each process separately, which slows things down. They tried using a multiprocessing queue to share the service_context among processes, but encountered an error: "cannot pickle 'builtins.CoreBPE' object".
The comments suggest that the solution is to host the model and send requests to a queue for processing. Community members recommend using a model server like vLLM or Hugging Face's text-generation-interface, and provide suggestions to use Langchain's LLM for the text-generation-interface.
I am using local models for a Q&A RAG pipeline. I am trying to use multiprocessing to get the best use of my resources. I am able to make it work but I have to load models for each process separately but that slows things down. I tried using multiprocessing queue to share the service_context among processes but got this error: cannot pickle 'builtins.CoreBPE' object