Find answers from the community

Updated last year

RAG Time

Hi is there anyway to speed up the response generation of a simple query engine?

My query engine looks like this. My QA_PROMPT is quite long. Not sure if that's slowing down the response generation. But what are the other factors that can speed things up?
Attachment
image.png
W
D
l
6 comments
Major portion of the time goes during response generation. It depends on the type of LLM you are using.

Llamaindex recently posted an article related to how you can improve your RAG system: https://blog.llamaindex.ai/evaluating-the-ideal-chunk-size-for-a-rag-system-using-llamaindex-6207e5d3fec5
thanks. Do you know the default LLM a query engine uses?
If you have set openAI key, It will use GPT3.5 else it'll use lllama2 locally
got it, thanks!
You could use the retriever and measure the time (index.as_retriever()) to check if the problem is in the llm generation vs retriever.
My initial hypothesis is that the generation is taking much longer than the retrieval
If you let your model generate more than 256 tokens it can take a while as well - I would limit the token context to 256.
You can also add to your prompt: do not answer with more than XYZ words
Add a reply
Sign up and join the conversation on Discord