Log in
Log into community
Find answers from the community
View all posts
Related posts
Did this answer your question?
๐
๐
๐
Powered by
Hall
Inactive
Updated 8 months ago
0
Follow
Hi All , do we have any llm server
Hi All , do we have any llm server
Inactive
0
Follow
T
Tech explorer
8 months ago
ยท
Hi All , do we have any llm server framework like vllm, open llm which can be run only on cpu and make use of multiple cpus in cluster like master and worker to serve multiple inferences parallel
L
T
8 comments
Share
Open in Discord
L
Logan M
8 months ago
Running on CPU will be terribly slow -- its really not recommended
I think its why theres no general server support for this type of setup.
My best guess is running ollama inside some auto-scalling kubernetes cluster
T
Tech explorer
8 months ago
Okay , I don't have kubernetes. But I have cpu with 48 core still unable to serve multiple queries parallel
L
Logan M
8 months ago
In order to serve in parallel, you need to make a copy of the model in memory
L
Logan M
8 months ago
The model itself has to process things seqentially (some frameworks have batch-inferencing, which is similar)
T
Tech explorer
8 months ago
Which frameworks support batch inferencing on cpu .
L
Logan M
8 months ago
I actually don't know haha but I know its theoretically possible
L
Logan M
8 months ago
Again, running this on CPU is really not ideal, hence the general lack of support in the ecosystem
T
Tech explorer
8 months ago
Okay . Yup
Add a reply
Sign up and join the conversation on Discord
Join on Discord