Find answers from the community

Updated last year

LLM time

At a glance

The community member has developed a chatbot for their company and is facing performance issues. They are using a recursive retriever to query information from different data sources, and then using a CondenseChatEngine to process the results. The main issue is the slow response time of the first LLM call in the CondenseChatEngine, which takes 2-3 seconds.

The community members discuss potential ways to reduce the latency, such as trying different chunk sizes, limiting the context, and streaming. However, they note that the number and size of the LLM calls are the main bottlenecks, and it may not be possible to significantly reduce the latency. One community member suggests that the 1.7 seconds response time for the OpenAI LLM is quite fast.

The community member also provides details on the technologies they are using, including Pinecone, MongoDB, and OpenAI's GPT-3.5-turbo model. They mention that they are considering optimizing the service context with a prompt helper, but do not receive any specific hints on how to do this.

hi guys i have finish my first company chatbot on custom docs, and it work very weel (very nice lib guys!) i have one question for performance. i attack the current code to make the query engine and the chatEngine, i use recursiveRetriver because i have the information in different data sources (yes it's pretty shit but i can't change the data sources and use something else like SubQueryEngine) and after that i pick the first 2 results of each index and use that results in the CondenseChatEngine, it all work well but there is any way to reduce latency? i have tried different chunk sizes, limiting the context ecc and streaming, the problem seems to be the first LLM call of condenseChatEngine that is pretty slow (2-3 seconds) and so i try the others engines but they produce to me less quality results, any hint is appreciated πŸ™‚
L
s
13 comments
Yea, not really possible to reduce the time I think πŸ€”

there's two main bottlenecks

  1. The number of LLM calls
  2. The size of each LLM call
Both of those are pretty hard to get around in effective systems πŸ€”
yeah, the only problem is the first LLM call
that slow down the user experience
the condense_question (the most important LOL)
I'm surprised that it's slow, because the overall prompt should be quite small
What LLM are you using?
Plain Text
pinecone.init(api_key=os.getenv("PINECONE_API_KEY"),
              environment=os.getenv("PINECONE_ENV"))
openai.api_key = os.getenv("OPENAI_API_KEY")

pinecone_index = pinecone.Index("messe-chatbot")

index_store = MongoIndexStore.from_uri(uri=os.getenv(
    "MONGODB_URI"), db_name=os.getenv("MONGODB_DATABASE"))
vector_store = PineconeVectorStore(pinecone_index=pinecone_index)

llm_predictor = LLMPredictor(llm=ChatOpenAI(
    temperature=0, model_name='gpt-3.5-turbo'))

llama_debug = LlamaDebugHandler(print_trace_on_end=True)
callback_manager = CallbackManager([llama_debug])

service_context = ServiceContext.from_defaults(
    llm_predictor=llm_predictor,
    num_output=512,
    callback_manager=callback_manager
)
maybe now i am watching to optimize service context with ---> prompt_helper
do you have any hint on that?
here the trace: Querying with: Come posso raggiungere la fiera?
**
Trace: chat
|_templating -> 0.000105 seconds
|_llm -> 1.779513 seconds
|_query -> 1.87383 seconds
|_retrieve -> 1.868156 seconds
|_synthesize -> 0.005423 seconds
|_templating -> 2e-05 seconds
|_llm -> 0.0 seconds
|_llm -> 0.0 seconds
**
|_llm -> 1.779513 seconds ---> this is the slow one
I think that's as fast as it gets for now (1.7s is quite fast tbh for openai)
Add a reply
Sign up and join the conversation on Discord