Find answers from the community

Updated 2 months ago

How to make llamacpp model inference

How to make llamacpp model inference faster. ? Iam using llamaindex rag with local gguf model and it's taking more than 2 min for single query
L
T
h
12 comments
Unless llamacpp was compiled for GPU, thats as fast as its getting
I don't have GPU. Iam running on cpu with 12 core 48 Gb Ram. Still it takes more than 2 mins for 7b model
that sounds about right. I have 16-core 48GB and I have had a similar experience on CPU πŸ™‚
Tinyllama is giving in 40 seconds but output is not as expected for my Rag use case
Any best small language model to use with RAG
You should limit the llm calls and prepare retriving and prompting
Iam running tinyllama in llamacpp ( not pythonllamacpp its slow ) and i get like 10t/s on a 4gb vram built in mobile gpu and like 5-6t/s on gpu using like 4gb ram
How can it be done . Could you please guide me to examples .
I am using llamaindex LlamaCpp to load llm
Look into receivers also
And node size when idexing
Add a reply
Sign up and join the conversation on Discord