Find answers from the community

Updated last year

LLM time

hi guys i have finish my first company chatbot on custom docs, and it work very weel (very nice lib guys!) i have one question for performance. i attack the current code to make the query engine and the chatEngine, i use recursiveRetriver because i have the information in different data sources (yes it's pretty shit but i can't change the data sources and use something else like SubQueryEngine) and after that i pick the first 2 results of each index and use that results in the CondenseChatEngine, it all work well but there is any way to reduce latency? i have tried different chunk sizes, limiting the context ecc and streaming, the problem seems to be the first LLM call of condenseChatEngine that is pretty slow (2-3 seconds) and so i try the others engines but they produce to me less quality results, any hint is appreciated πŸ™‚
L
s
13 comments
Yea, not really possible to reduce the time I think πŸ€”

there's two main bottlenecks

  1. The number of LLM calls
  2. The size of each LLM call
Both of those are pretty hard to get around in effective systems πŸ€”
yeah, the only problem is the first LLM call
that slow down the user experience
the condense_question (the most important LOL)
I'm surprised that it's slow, because the overall prompt should be quite small
What LLM are you using?
Plain Text
pinecone.init(api_key=os.getenv("PINECONE_API_KEY"),
              environment=os.getenv("PINECONE_ENV"))
openai.api_key = os.getenv("OPENAI_API_KEY")

pinecone_index = pinecone.Index("messe-chatbot")

index_store = MongoIndexStore.from_uri(uri=os.getenv(
    "MONGODB_URI"), db_name=os.getenv("MONGODB_DATABASE"))
vector_store = PineconeVectorStore(pinecone_index=pinecone_index)

llm_predictor = LLMPredictor(llm=ChatOpenAI(
    temperature=0, model_name='gpt-3.5-turbo'))

llama_debug = LlamaDebugHandler(print_trace_on_end=True)
callback_manager = CallbackManager([llama_debug])

service_context = ServiceContext.from_defaults(
    llm_predictor=llm_predictor,
    num_output=512,
    callback_manager=callback_manager
)
maybe now i am watching to optimize service context with ---> prompt_helper
do you have any hint on that?
here the trace: Querying with: Come posso raggiungere la fiera?
**
Trace: chat
|_templating -> 0.000105 seconds
|_llm -> 1.779513 seconds
|_query -> 1.87383 seconds
|_retrieve -> 1.868156 seconds
|_synthesize -> 0.005423 seconds
|_templating -> 2e-05 seconds
|_llm -> 0.0 seconds
|_llm -> 0.0 seconds
**
|_llm -> 1.779513 seconds ---> this is the slow one
I think that's as fast as it gets for now (1.7s is quite fast tbh for openai)
Add a reply
Sign up and join the conversation on Discord