Find answers from the community

Updated 9 months ago

Hi , I seen llmlingua and tried with

Hi , I seen llmlingua and tried with llamaindex , also using llamacpp for loading llm. For each question the time taken to get prefix match hit is too high . Llm inference time although reduced, time taken to hit llm is so high that without llmlingua my chat engine is giving faster response. Any idea on this. Iam using only cpu
L
T
4 comments
Using CPU is the limiting factor here -- llama.cpp is just really not that fast
Also llamaindex chatengine.stream_chat is behaving weird , like if i I ask same question again it's giving weird response and generating itself a user: some garbage question. How to prevent it ?
And also what's the alternative for llamacpp for cpu machines
Llama.cpp is the best option for cpu lol

Also I'm not sure about that chat thing. It depends on which chat engine you are using, how you setup the llm, etc.
Add a reply
Sign up and join the conversation on Discord