LLM time

At a glance

The community member has developed a chatbot for their company and is facing performance issues. They are using a recursive retriever to query information from different data sources, and then using a CondenseChatEngine to process the results. The main issue is the slow response time of the first LLM call in the CondenseChatEngine, which takes 2-3 seconds.

The community members discuss potential ways to reduce the latency, such as trying different chunk sizes, limiting the context, and streaming. However, they note that the number and size of the LLM calls are the main bottlenecks, and it may not be possible to significantly reduce the latency. One community member suggests that the 1.7 seconds response time for the OpenAI LLM is quite fast.

The community member also provides details on the technologies they are using, including Pinecone, MongoDB, and OpenAI's GPT-3.5-turbo model. They mention that they are considering optimizing the service context with a prompt helper, but do not receive any specific hints on how to do this.

ssl33p

hi guys i have finish my first company chatbot on custom docs, and it work very weel (very nice lib guys!) i have one question for performance. i attack the current code to make the query engine and the chatEngine, i use recursiveRetriver because i have the information in different data sources (yes it's pretty shit but i can't change the data sources and use something else like SubQueryEngine) and after that i pick the first 2 results of each index and use that results in the CondenseChatEngine, it all work well but there is any way to reduce latency? i have tried different chunk sizes, limiting the context ecc and streaming, the problem seems to be the first LLM call of condenseChatEngine that is pretty slow (2-3 seconds) and so i try the others engines but they produce to me less quality results, any hint is appreciated 🙂

13 comments

LLogan M

Yea, not really possible to reduce the time I think 🤔

there's two main bottlenecks

The number of LLM calls
The size of each LLM call

Both of those are pretty hard to get around in effective systems 🤔

ssl33p

yeah, the only problem is the first LLM call

ssl33p

that slow down the user experience

ssl33p

the condense_question (the most important LOL)

LLogan M

I'm surprised that it's slow, because the overall prompt should be quite small

LLogan M

What LLM are you using?

ssl33p

Plain Text

pinecone.init(api_key=os.getenv("PINECONE_API_KEY"),
              environment=os.getenv("PINECONE_ENV"))
openai.api_key = os.getenv("OPENAI_API_KEY")

pinecone_index = pinecone.Index("messe-chatbot")

index_store = MongoIndexStore.from_uri(uri=os.getenv(
    "MONGODB_URI"), db_name=os.getenv("MONGODB_DATABASE"))
vector_store = PineconeVectorStore(pinecone_index=pinecone_index)

llm_predictor = LLMPredictor(llm=ChatOpenAI(
    temperature=0, model_name='gpt-3.5-turbo'))

llama_debug = LlamaDebugHandler(print_trace_on_end=True)
callback_manager = CallbackManager([llama_debug])

service_context = ServiceContext.from_defaults(
    llm_predictor=llm_predictor,
    num_output=512,
    callback_manager=callback_manager
)

ssl33p

maybe now i am watching to optimize service context with ---> prompt_helper

ssl33p

do you have any hint on that?

ssl33p

ssl33p

|_llm -> 1.779513 seconds ---> this is the slow one

LLogan M

I think that's as fast as it gets for now (1.7s is quite fast tbh for openai)

ssl33p

kk thanks

Add a reply

Find answers from the community

LLM time