How to make llamacpp model inference

At a glance

The community member is using llamaindex RAG with a local gguf model, and it's taking more than 2 minutes for a single query. They don't have a GPU and are running on a CPU with 12 cores and 48GB of RAM, but the process is still slow. Another community member suggests that unless llamacpp was compiled for GPU, this is as fast as it's going to get. The community member also tried Tinyllama, which is faster (40 seconds) but the output is not as expected for their RAG use case. They are looking for the best small language model to use with RAG and ways to optimize the process, such as limiting LLM calls, preparing retrieval and prompting, and configuring node post-processors and receivers. There is no explicitly marked answer in the comments.

Useful resources

TTech explorer

How to make llamacpp model inference faster. ? Iam using llamaindex rag with local gguf model and it's taking more than 2 min for single query

12 comments

LLogan M

Unless llamacpp was compiled for GPU, thats as fast as its getting

TTech explorer

I don't have GPU. Iam running on cpu with 12 core 48 Gb Ram. Still it takes more than 2 mins for 7b model

LLogan M

that sounds about right. I have 16-core 48GB and I have had a similar experience on CPU 🙂

TTech explorer

Tinyllama is giving in 40 seconds but output is not as expected for my Rag use case

TTech explorer

Any best small language model to use with RAG

hhansson0728

You should limit the llm calls and prepare retriving and prompting

hhansson0728

Iam running tinyllama in llamacpp ( not pythonllamacpp its slow ) and i get like 10t/s on a 4gb vram built in mobile gpu and like 5-6t/s on gpu using like 4gb ram

TTech explorer

How can it be done . Could you please guide me to examples .

TTech explorer

I am using llamaindex LlamaCpp to load llm

TTech explorer

https://docs.llamaindex.ai/en/latest/understanding/querying/querying.html

Configuring node post processors ?

hhansson0728

Look into receivers also

hhansson0728

And node size when idexing

Add a reply

Find answers from the community

How to make llamacpp model inference