```

At a glance

Plain Text

 Querying with: Come posso arrivare alla fiera?
Oct 11 08:32:40 messe-rag-chatbot app/web.1 **********
Oct 11 08:32:40 messe-rag-chatbot app/web.1 Trace: chat
Oct 11 08:32:40 messe-rag-chatbot app/web.1     |_CBEventType.TEMPLATING ->  4.4e-05 seconds
Oct 11 08:32:40 messe-rag-chatbot app/web.1     |_CBEventType.LLM ->  2.075113 seconds
Oct 11 08:32:40 messe-rag-chatbot app/web.1     |_CBEventType.QUERY ->  2.032007 seconds
Oct 11 08:32:40 messe-rag-chatbot app/web.1       |_CBEventType.RETRIEVE ->  2.027069 seconds
Oct 11 08:32:40 messe-rag-chatbot app/web.1       |_CBEventType.SYNTHESIZE ->  0.004779 seconds
Oct 11 08:32:40 messe-rag-chatbot app/web.1         |_CBEventType.TEMPLATING ->  3.3e-05 seconds
Oct 11 08:32:40 messe-rag-chatbot app/web.1         |_CBEventType.LLM ->  0.0 seconds
Oct 11 08:32:40 messe-rag-chatbot app/web.1     |_CBEventType.LLM ->  0.0 seconds
Oct 11 08:32:40 messe-rag-chatbot app/web.1 **********

18 comments

LLogan M

the templating time is the fastest thing listed there 👀

LLogan M

0.0000044s

ssl33p

oh fuck i am retarded

ssl33p

there is a way / best pratices to reduce other times

ssl33p

i am exploring the best chunk size from your blog articles

LLogan M

hahaha it happens

The slowest thing here seems to be the LLM, which can't really be sped up.

What does your setup look like? I'm assuming you have a chat engine with something else?

ssl33p

yeah i will provide you the chatEngine and query engine creation

ssl33p

Plain Text

 service_context = get_service_context()
  history = retrieve_chat_history(chatId)

  chat_engine = CondenseQuestionChatEngine.from_defaults(
    query_engine=query_engine, 
    condense_question_prompt=custom_prompt,
    chat_history=history,
    service_context=service_context,
    verbose=True
  )

  response = chat_engine.stream_chat(query_text)

  return Response(send_and_save_response(response, chatId, query_text), mimetype='application/json')

ssl33p

first file is the way i build the RecursiveRetriever, the second file i take the query engine from the recursive retriver and i build a CondenseQuestionChatEngine

ssl33p

i need RecursiveRetriever because as you suggest to me some days ago i need a way to lunch the same query on different sources (indexes) that may have similar information and take the best output

ssl33p

i use pinecone to store the vector data from the sources

LLogan M

So the condense chat engine will use one LLM call to write a new query based on the chat history

Then, there will be at least one more LLM call to actually query the index

So, I don't see an easy way to improve latency from there 🤔 Using streaming always helps make things a little faster though

ssl33p

yeah i already use the streaming system

ssl33p

(i don't care about if the endpoint take 20 seconds i only want the user to see the output starting from 3-4 seconds, now take 7)

ssl33p

np and really thanks for the help i will try different chunk sizes!

LLogan M

You could also try and use the callbacks to pull intermediate information, like the LLM re-phrasing the input. Just to give the user something to read haha

LLogan M

a little complex to setup though, you'd have to write a custom callback

ssl33p

thanks for the info Logan have a nice day

Add a reply

Find answers from the community

```