Hi, I notice a big slow down on LlamaIndex depending on the version of
llama-cpp-python
I have installed (older versions are much faster, about 100 tokens per second versus about 30 tokens per second).
llama-cpp-python==0.2.20
is the last fast version with the latest LlamaIndex (3090 on Ubuntu, using Mistral 7b). I believe it has to do with kv cache and is solved by the suggestions in this GitHub issue:
https://github.com/abetlen/llama-cpp-python/issues/1054 . How do we add the required
offload_kqv=True
to LlamaIndex to regain fast inference? Is this a regression in LlamaIndex or something the users should handle?