Find answers from the community

Updated 11 months ago

Hi All , do we have any llm server

At a glance

Hi All , do we have any llm server framework like vllm, open llm which can be run only on cpu and make use of multiple cpus in cluster like master and worker to serve multiple inferences parallel

8 comments

LLogan M

Running on CPU will be terribly slow -- its really not recommended

I think its why theres no general server support for this type of setup.

My best guess is running ollama inside some auto-scalling kubernetes cluster

TTech explorer

Okay , I don't have kubernetes. But I have cpu with 48 core still unable to serve multiple queries parallel

LLogan M

In order to serve in parallel, you need to make a copy of the model in memory

LLogan M

The model itself has to process things seqentially (some frameworks have batch-inferencing, which is similar)

TTech explorer

Which frameworks support batch inferencing on cpu .

LLogan M

I actually don't know haha but I know its theoretically possible

LLogan M

Again, running this on CPU is really not ideal, hence the general lack of support in the ecosystem

TTech explorer

Okay . Yup

Add a reply