Find answers from the community

Updated 2 months ago

performance degradation between 0.2.18 a...

Hi, I notice a big slow down on LlamaIndex depending on the version of llama-cpp-python I have installed (older versions are much faster, about 100 tokens per second versus about 30 tokens per second). llama-cpp-python==0.2.20 is the last fast version with the latest LlamaIndex (3090 on Ubuntu, using Mistral 7b). I believe it has to do with kv cache and is solved by the suggestions in this GitHub issue: https://github.com/abetlen/llama-cpp-python/issues/1054 . How do we add the required offload_kqv=True to LlamaIndex to regain fast inference? Is this a regression in LlamaIndex or something the users should handle?
L
k
7 comments
This is something users should handle I think? Although we could add that as the default. I don't really keep up with llama.cpp

You can set this like
llm = LlamaCPP(... model_kwargs={"offload_kqv": True})
Adding the flag to model_kwargs works for me. But I would suggest making it a default value as the change in llama-cpp-python seems to affect all GPUs for the mistral model at least, see also: https://github.com/abetlen/llama-cpp-python/issues/999
I'm hesitant to make the default different than the default in llama.cpp (there must be a reason they did that, right?)

Is there any discussion on what this option actually does?
This might be the key PR in llama.cpp. I'm not sure why it was done though (ie, what backward compatibility they're trying to retain). https://github.com/ggerganov/llama.cpp/pull/4309
The CUDA performance with quantum cache is a bit disappointing. Hopefully we will fix this in the future lol
Lots of people reporting speedups though, weird
actually reports on both sides
Add a reply
Sign up and join the conversation on Discord