Debug

At a glance

I am using a vector index as openai chat engine but the problem is that it is very slow. Sometimes it takes more than 15 seconds. My index is less than 15mb large, shouldn't it be very fast for such a small size? I know that the GPT response itself is slow and there's nothing I can really do to speed it up but from my timing it takes at most half the time on average and the rest is from the querying.

24 comments

WWhiteFang_Jr

You could check at which step it is taking the max time and inspect.

For better logs you could set

Plain Text

import llama_index

llama_index.set_global_handler("simple")

This will tell you how much time is spent at each step.

WWhiteFang_Jr

For more:https://gpt-index.readthedocs.io/en/latest/end_to_end_tutorials/one_click_observability.html#arize-phoenix

ttoniuyt

It doesn't really show the execution time, only shows some bonus logs. But I am doing the timing myself.

1.93 sec for the initial agent_chat_response = self._get_agent_response(mode=mode, **llm_chat_kwargs)

Afterwards we have

Plain Text

=== Calling Function ===
Calling function: query_engine_tool with args: {
  "input": "some question..."
}
Got output: ....

which took 7.155 secs

And finally 6.54 sec fot the agent_response with current_func='auto' which I assume is the GPT resposne itself am I correct?
I can't speed up that but the other half time is the index query which can be sped up

I also got the feeling as if things got worse after I updated to latest version of llama-index

WWhiteFang_Jr

How much nodes do you have in your index?
And what is the chunk size?
IMO if the chunk size is very small it increases the time interval for finding the most similar records as more number of nodes needs to be checked.

ttoniuyt

I have 250 documents, each one is sort of small, a couple of paragraphs. I haven't messed with chunk size. What is the default?

WWhiteFang_Jr

Default is 1024

ttoniuyt

Ok, I'll try to increase it but shouldn't such a small index work very fast?

WWhiteFang_Jr

Nah default is fine in your case as your documents are already small.
You could try adding Metadata like important information with each record.

ttoniuyt

I also notice that it is considerably slower on a google cloud instance that i'm running than on my local machine. Does the use of SSD speed things up a lot?

WWhiteFang_Jr

Yeah if you are building embedding locally then a SSD is better

ttoniuyt

How would I go about adding the metadata ? I think I have to add to the begging of the text key value pairs? Right now my documents are all in the format
{title}, {content}

if I do something like
title: {title}

{title}, {content}

Would it consider the title as metadata and search up faster from it

WWhiteFang_Jr

So your data is structured? Like have good title and good description?

ttoniuyt

I think for most of the documents yes.

WWhiteFang_Jr

I think in that case you'll have to check if it is really the retrieving portion issue. Try this once see if this can help you identify the root cause: https://docs.llamaindex.ai/en/stable/examples/callbacks/LlamaDebugHandler.html#llama-debug-handler

ttoniuyt

This is from a simple query that such as "Hello"

Plain Text

**********
Trace: chat
    |_agent_step ->  7.441084 seconds
      |_llm ->  2.736743 seconds
      |_function_call ->  1.853167 seconds
      |_llm ->  2.847154 seconds
**********

This is from a meaningful questions regarding the information in the index

Plain Text

**********
Trace: chat
    |_agent_step ->  27.780369 seconds
      |_llm ->  2.179713 seconds
      |_function_call ->  13.60436 seconds
      |_llm ->  11.994729 seconds
**********

ttoniuyt

the function_call is the index querying right ? It's a very big portion of the time waited

ttoniuyt

By the way the way I am creating my index is

Plain Text

 llm_predictor = LLMPredictor(llm=ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo"))
 storage_context = StorageContext.from_defaults(persist_dir=index_dir)
 index = load_index_from_storage(storage_context, llm_predictor=llm_predictor)

Do I need the llm predictor if I am using openai chat engine. Could it be that it is using a GPT both for retrieving and the actual response?

WWhiteFang_Jr

Yes, that would be the index query part. And ideally it should not take this much time 👀

WWhiteFang_Jr

Llamaindex has a better wrapper for openAI, you can directly pass the llm now
https://gpt-index.readthedocs.io/en/latest/examples/chat_engine/chat_engine_openai.html

WWhiteFang_Jr

I would suggest you use debugger to check the code flow, that will help you to understand the issue much better.

ttoniuyt

I debugged a bit but still not sure what to do. The query response just atkes a lot of time and after that the actual GPT response takes around the same time. Could it be that GPT is also being used in the retrieval?

ttoniuyt

From the debugging I saw that on each of the three steps a request is made.

WWhiteFang_Jr

I checked the code, I think the code runs until max_function_calls value is reached in your case that could be increasing your time.
Could you share your query again so that @Logan M does not have to go through the entire conversation 😅

ttoniuyt

I think the OpenAi chat engine just works like this. I switched to context chat engine and now it seems to work faster.

Add a reply

Find answers from the community

Debug