Find answers from the community

Updated 3 months ago

```

Plain Text
 Querying with: Come posso arrivare alla fiera?
Oct 11 08:32:40 messe-rag-chatbot app/web.1 **********
Oct 11 08:32:40 messe-rag-chatbot app/web.1 Trace: chat
Oct 11 08:32:40 messe-rag-chatbot app/web.1     |_CBEventType.TEMPLATING ->  4.4e-05 seconds
Oct 11 08:32:40 messe-rag-chatbot app/web.1     |_CBEventType.LLM ->  2.075113 seconds
Oct 11 08:32:40 messe-rag-chatbot app/web.1     |_CBEventType.QUERY ->  2.032007 seconds
Oct 11 08:32:40 messe-rag-chatbot app/web.1       |_CBEventType.RETRIEVE ->  2.027069 seconds
Oct 11 08:32:40 messe-rag-chatbot app/web.1       |_CBEventType.SYNTHESIZE ->  0.004779 seconds
Oct 11 08:32:40 messe-rag-chatbot app/web.1         |_CBEventType.TEMPLATING ->  3.3e-05 seconds
Oct 11 08:32:40 messe-rag-chatbot app/web.1         |_CBEventType.LLM ->  0.0 seconds
Oct 11 08:32:40 messe-rag-chatbot app/web.1     |_CBEventType.LLM ->  0.0 seconds
Oct 11 08:32:40 messe-rag-chatbot app/web.1 **********
L
s
18 comments
the templating time is the fastest thing listed there πŸ‘€
oh fuck i am retarded
there is a way / best pratices to reduce other times
i am exploring the best chunk size from your blog articles
hahaha it happens

The slowest thing here seems to be the LLM, which can't really be sped up.

What does your setup look like? I'm assuming you have a chat engine with something else?
yeah i will provide you the chatEngine and query engine creation
Plain Text
 service_context = get_service_context()
  history = retrieve_chat_history(chatId)

  chat_engine = CondenseQuestionChatEngine.from_defaults(
    query_engine=query_engine, 
    condense_question_prompt=custom_prompt,
    chat_history=history,
    service_context=service_context,
    verbose=True
  )

  response = chat_engine.stream_chat(query_text)

  return Response(send_and_save_response(response, chatId, query_text), mimetype='application/json')
first file is the way i build the RecursiveRetriever, the second file i take the query engine from the recursive retriver and i build a CondenseQuestionChatEngine
i need RecursiveRetriever because as you suggest to me some days ago i need a way to lunch the same query on different sources (indexes) that may have similar information and take the best output
i use pinecone to store the vector data from the sources
So the condense chat engine will use one LLM call to write a new query based on the chat history

Then, there will be at least one more LLM call to actually query the index

So, I don't see an easy way to improve latency from there πŸ€” Using streaming always helps make things a little faster though
yeah i already use the streaming system
(i don't care about if the endpoint take 20 seconds i only want the user to see the output starting from 3-4 seconds, now take 7)
np and really thanks for the help i will try different chunk sizes!
You could also try and use the callbacks to pull intermediate information, like the LLM re-phrasing the input. Just to give the user something to read haha
a little complex to setup though, you'd have to write a custom callback
thanks for the info Logan have a nice day
Add a reply
Sign up and join the conversation on Discord