The community member is using llamaindex RAG with a local gguf model, and it's taking more than 2 minutes for a single query. They don't have a GPU and are running on a CPU with 12 cores and 48GB of RAM, but the process is still slow. Another community member suggests that unless llamacpp was compiled for GPU, this is as fast as it's going to get. The community member also tried Tinyllama, which is faster (40 seconds) but the output is not as expected for their RAG use case. They are looking for the best small language model to use with RAG and ways to optimize the process, such as limiting LLM calls, preparing retrieval and prompting, and configuring node post-processors and receivers. There is no explicitly marked answer in the comments.
Iam running tinyllama in llamacpp ( not pythonllamacpp its slow ) and i get like 10t/s on a 4gb vram built in mobile gpu and like 5-6t/s on gpu using like 4gb ram