Find answers from the community

Updated 2 months ago

Debug

I am using a vector index as openai chat engine but the problem is that it is very slow. Sometimes it takes more than 15 seconds. My index is less than 15mb large, shouldn't it be very fast for such a small size? I know that the GPT response itself is slow and there's nothing I can really do to speed it up but from my timing it takes at most half the time on average and the rest is from the querying.
W
t
24 comments
You could check at which step it is taking the max time and inspect.

For better logs you could set
Plain Text
import llama_index

llama_index.set_global_handler("simple")

This will tell you how much time is spent at each step.
It doesn't really show the execution time, only shows some bonus logs. But I am doing the timing myself.

1.93 sec for the initial agent_chat_response = self._get_agent_response(mode=mode, **llm_chat_kwargs)

Afterwards we have
Plain Text
=== Calling Function ===
Calling function: query_engine_tool with args: {
  "input": "some question..."
}
Got output: ....

which took 7.155 secs

And finally 6.54 sec fot the agent_response with current_func='auto' which I assume is the GPT resposne itself am I correct?
I can't speed up that but the other half time is the index query which can be sped up


I also got the feeling as if things got worse after I updated to latest version of llama-index
How much nodes do you have in your index?
And what is the chunk size?
IMO if the chunk size is very small it increases the time interval for finding the most similar records as more number of nodes needs to be checked.
I have 250 documents, each one is sort of small, a couple of paragraphs. I haven't messed with chunk size. What is the default?
Default is 1024
Ok, I'll try to increase it but shouldn't such a small index work very fast?
Nah default is fine in your case as your documents are already small.
You could try adding Metadata like important information with each record.
I also notice that it is considerably slower on a google cloud instance that i'm running than on my local machine. Does the use of SSD speed things up a lot?
Yeah if you are building embedding locally then a SSD is better
How would I go about adding the metadata ? I think I have to add to the begging of the text key value pairs? Right now my documents are all in the format
{title}, {content}

if I do something like
title: {title}

{title}, {content}

Would it consider the title as metadata and search up faster from it
So your data is structured? Like have good title and good description?
I think for most of the documents yes.
I think in that case you'll have to check if it is really the retrieving portion issue. Try this once see if this can help you identify the root cause: https://docs.llamaindex.ai/en/stable/examples/callbacks/LlamaDebugHandler.html#llama-debug-handler
This is from a simple query that such as "Hello"
Plain Text
**********
Trace: chat
    |_agent_step ->  7.441084 seconds
      |_llm ->  2.736743 seconds
      |_function_call ->  1.853167 seconds
      |_llm ->  2.847154 seconds
**********


This is from a meaningful questions regarding the information in the index
Plain Text
**********
Trace: chat
    |_agent_step ->  27.780369 seconds
      |_llm ->  2.179713 seconds
      |_function_call ->  13.60436 seconds
      |_llm ->  11.994729 seconds
**********
the function_call is the index querying right ? It's a very big portion of the time waited
By the way the way I am creating my index is
Plain Text
 llm_predictor = LLMPredictor(llm=ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo"))
 storage_context = StorageContext.from_defaults(persist_dir=index_dir)
 index = load_index_from_storage(storage_context, llm_predictor=llm_predictor)

Do I need the llm predictor if I am using openai chat engine. Could it be that it is using a GPT both for retrieving and the actual response?
Yes, that would be the index query part. And ideally it should not take this much time πŸ‘€
Llamaindex has a better wrapper for openAI, you can directly pass the llm now
https://gpt-index.readthedocs.io/en/latest/examples/chat_engine/chat_engine_openai.html
I would suggest you use debugger to check the code flow, that will help you to understand the issue much better.
I debugged a bit but still not sure what to do. The query response just atkes a lot of time and after that the actual GPT response takes around the same time. Could it be that GPT is also being used in the retrieval?
From the debugging I saw that on each of the three steps a request is made.
I checked the code, I think the code runs until max_function_calls value is reached in your case that could be increasing your time.
Could you share your query again so that @Logan M does not have to go through the entire conversation πŸ˜…
I think the OpenAi chat engine just works like this. I switched to context chat engine and now it seems to work faster.
Add a reply
Sign up and join the conversation on Discord