Log in
Log into community
Find answers from the community
View all posts
Related posts
Did this answer your question?
π
π
π
Powered by
Hall
Inactive
Updated 2 months ago
0
Follow
How to make llamacpp model inference
How to make llamacpp model inference
Inactive
0
Follow
T
Tech explorer
10 months ago
Β·
How to make llamacpp model inference faster. ? Iam using llamaindex rag with local gguf model and it's taking more than 2 min for single query
L
T
h
12 comments
Share
Open in Discord
L
Logan M
10 months ago
Unless llamacpp was compiled for GPU, thats as fast as its getting
T
Tech explorer
10 months ago
I don't have GPU. Iam running on cpu with 12 core 48 Gb Ram. Still it takes more than 2 mins for 7b model
L
Logan M
10 months ago
that sounds about right. I have 16-core 48GB and I have had a similar experience on CPU π
T
Tech explorer
10 months ago
Tinyllama is giving in 40 seconds but output is not as expected for my Rag use case
T
Tech explorer
10 months ago
Any best small language model to use with RAG
h
hansson0728
10 months ago
You should limit the llm calls and prepare retriving and prompting
h
hansson0728
10 months ago
Iam running tinyllama in llamacpp ( not pythonllamacpp its slow ) and i get like 10t/s on a 4gb vram built in mobile gpu and like 5-6t/s on gpu using like 4gb ram
T
Tech explorer
10 months ago
How can it be done . Could you please guide me to examples .
T
Tech explorer
10 months ago
I am using llamaindex LlamaCpp to load llm
T
Tech explorer
10 months ago
https://docs.llamaindex.ai/en/latest/understanding/querying/querying.html
Configuring node post processors ?
h
hansson0728
10 months ago
Look into receivers also
h
hansson0728
10 months ago
And node size when idexing
Add a reply
Sign up and join the conversation on Discord
Join on Discord