Find answers from the community

Updated last year

RAG Time

Hi is there anyway to speed up the response generation of a simple query engine?

My query engine looks like this. My QA_PROMPT is quite long. Not sure if that's slowing down the response generation. But what are the other factors that can speed things up?

Attachment

6 comments

WWhiteFang_Jr

Major portion of the time goes during response generation. It depends on the type of LLM you are using.

Llamaindex recently posted an article related to how you can improve your RAG system: https://blog.llamaindex.ai/evaluating-the-ideal-chunk-size-for-a-rag-system-using-llamaindex-6207e5d3fec5

WWhiteFang_Jr

RAG Time

DDokmy

thanks. Do you know the default LLM a query engine uses?

WWhiteFang_Jr

If you have set openAI key, It will use GPT3.5 else it'll use lllama2 locally

DDokmy

got it, thanks!

llucastonon

You could use the retriever and measure the time (index.as_retriever()) to check if the problem is in the llm generation vs retriever.
My initial hypothesis is that the generation is taking much longer than the retrieval
If you let your model generate more than 256 tokens it can take a while as well - I would limit the token context to 256.
You can also add to your prompt: do not answer with more than XYZ words

Add a reply