Find answers from the community

Updated 3 months ago

hello. I am using LLI for a chatbot /

hello. I am using LLI for a chatbot /query engine using a Llama.cpp local 7b parameters model. each generation takes about 1.5 minutes. is that normal? I have 32 GB system ram, 4gbvram. How does one estimate how much resources are needed? what are the factors that improve the speed of inference. The data it is indexing is just three text files. I remember things being faster earlier.
a
L
2 comments
Plain Text
these are the numbers 
llama_print_timings:        load time =   16589.32 ms
llama_print_timings:      sample time =      47.42 ms /   199 runs   (    0.24 ms per token,  4196.28 tokens per second)
llama_print_timings: prompt eval time =   81969.71 ms /  1843 tokens (   44.48 ms per token,    22.48 tokens per second)
llama_print_timings:        eval time =   26429.32 ms /   198 runs   (  133.48 ms per token,     7.49 tokens per second)
llama_print_timings:       total time =  108885.07 ms /  2041 tokens
That sounds about right to me for CPU runtime. Did you load the model onto gpu?

fyi I would try using ollama, it just makes the setup 1000x easier, and handles all the loading and optimizing for you
Add a reply
Sign up and join the conversation on Discord