performance degradation between 0.2.18 a...

At a glance

Hi, I notice a big slow down on LlamaIndex depending on the version of llama-cpp-python I have installed (older versions are much faster, about 100 tokens per second versus about 30 tokens per second). llama-cpp-python==0.2.20 is the last fast version with the latest LlamaIndex (3090 on Ubuntu, using Mistral 7b). I believe it has to do with kv cache and is solved by the suggestions in this GitHub issue: https://github.com/abetlen/llama-cpp-python/issues/1054 . How do we add the required offload_kqv=True to LlamaIndex to regain fast inference? Is this a regression in LlamaIndex or something the users should handle?

7 comments

LLogan M

This is something users should handle I think? Although we could add that as the default. I don't really keep up with llama.cpp

You can set this like
llm = LlamaCPP(... model_kwargs={"offload_kqv": True})

kkj6i7qg2

Adding the flag to model_kwargs works for me. But I would suggest making it a default value as the change in llama-cpp-python seems to affect all GPUs for the mistral model at least, see also: https://github.com/abetlen/llama-cpp-python/issues/999

LLogan M

I'm hesitant to make the default different than the default in llama.cpp (there must be a reason they did that, right?)

Is there any discussion on what this option actually does?

kkj6i7qg2

This might be the key PR in llama.cpp. I'm not sure why it was done though (ie, what backward compatibility they're trying to retain). https://github.com/ggerganov/llama.cpp/pull/4309

LLogan M

The CUDA performance with quantum cache is a bit disappointing. Hopefully we will fix this in the future lol

Find answers from the community

performance degradation between 0.2.18 a...