Find answers from the community

Updated 8 months ago

Hi All , do we have any llm server

Hi All , do we have any llm server framework like vllm, open llm which can be run only on cpu and make use of multiple cpus in cluster like master and worker to serve multiple inferences parallel
L
T
8 comments
Running on CPU will be terribly slow -- its really not recommended

I think its why theres no general server support for this type of setup.

My best guess is running ollama inside some auto-scalling kubernetes cluster
Okay , I don't have kubernetes. But I have cpu with 48 core still unable to serve multiple queries parallel
In order to serve in parallel, you need to make a copy of the model in memory
The model itself has to process things seqentially (some frameworks have batch-inferencing, which is similar)
Which frameworks support batch inferencing on cpu .
I actually don't know haha but I know its theoretically possible
Again, running this on CPU is really not ideal, hence the general lack of support in the ecosystem
Add a reply
Sign up and join the conversation on Discord