Find answers from the community

Updated 4 months ago

How to make llamacpp model inference

At a glance

The community member is using llamaindex RAG with a local gguf model, and it's taking more than 2 minutes for a single query. They don't have a GPU and are running on a CPU with 12 cores and 48GB of RAM, but the process is still slow. Another community member suggests that unless llamacpp was compiled for GPU, this is as fast as it's going to get. The community member also tried Tinyllama, which is faster (40 seconds) but the output is not as expected for their RAG use case. They are looking for the best small language model to use with RAG and ways to optimize the process, such as limiting LLM calls, preparing retrieval and prompting, and configuring node post-processors and receivers. There is no explicitly marked answer in the comments.

Useful resources
How to make llamacpp model inference faster. ? Iam using llamaindex rag with local gguf model and it's taking more than 2 min for single query
L
T
h
12 comments
Unless llamacpp was compiled for GPU, thats as fast as its getting
I don't have GPU. Iam running on cpu with 12 core 48 Gb Ram. Still it takes more than 2 mins for 7b model
that sounds about right. I have 16-core 48GB and I have had a similar experience on CPU πŸ™‚
Tinyllama is giving in 40 seconds but output is not as expected for my Rag use case
Any best small language model to use with RAG
You should limit the llm calls and prepare retriving and prompting
Iam running tinyllama in llamacpp ( not pythonllamacpp its slow ) and i get like 10t/s on a 4gb vram built in mobile gpu and like 5-6t/s on gpu using like 4gb ram
How can it be done . Could you please guide me to examples .
I am using llamaindex LlamaCpp to load llm
Look into receivers also
And node size when idexing
Add a reply
Sign up and join the conversation on Discord