Find answers from the community

Updated 2 months ago

Local models

Hi, I am using llamaindex with local open source model and the backend was created using FastAPI. The latency for concurrent users was very high and then when tried to increase the number of workers it needed the model to be loaded for the number of workers assigned but I do not have enough gpu memory for that, So how can we do this with llamaindex and FastAPI?
L
1 comment
Correct, you need a copy of the model in memory to actually serve multiple requests.

You could look into a more intelligent server to serve your model, that can do proper batching or optimized inference.

Stuff like vLLM or TGI
Add a reply
Sign up and join the conversation on Discord