Find answers from the community

Updated 3 months ago

Optimizing Response Time for CondensePlusContextChatEngine

At a glance

The community member is using CondensePlusContextChatEngine for chatting, but is facing an issue with long response times, averaging over 12 seconds. The community members discuss various ways to optimize the response time, including:

- Using a faster language model (LLM) or prompting the LLM to write shorter responses

- Trying different vector stores like FaissVectorStore and Qdrant, and using observability tools to identify the root cause of the delay

- Experimenting with creating a single retriever for all documents vs. multiple retrievers for different topics

- Reducing the number of queries generated by the QueryFusionRetriever

- Caching the index to load it faster next time

The community members share their experiences and provide suggestions, but there is no explicitly marked answer in the comments.

Useful resources

BBrent

Hi @all.
I am using CondensePlusContextChatEngine for chatting, but I am facing an issue where the response time is too long, averaging over 12 seconds to return an answer. How can I optimize this response time?

50 comments

WWhiteFang_Jr

CondensePlusContextChatEngine provide response in two step:

Condenses a question based on user query and previous chat history.
fetches nodes and then generates an answer based on fetched nodes and updated query.

If you are using open-source llms then it may be taking long as these two steps makes llm call

BBrent

Yes but I find it a bit long. How can we improve the speed and the answer is still the same. I have researched and there is a solution to reduce nodes but sometimes reducing nodes does not get the desired results.

LLogan M

The main bottle neck is the llm calls. There is not way to speed that up execpt using either a faster llm, or prompting the llm to write shorter responses

openai, anthropic, etc run very fast

local llms will depend on your hardware

BBrent

I'm currently using gpt-4o-mini but it sometimes takes nearly 10 seconds to query stream_chat, maybe because my document is heavy. Is there any other way to optimize it?

BBrent

@Logan M @WhiteFang_Jr

I tried FaissVectorStore and saw speed improvement, but when I use multiple indexes via QueryFusionRetriever and CondensePlusContextChatEngine, I often get error '2', '6', '141' when using chat_engine.stream_chat(input_text). I don't understand where I went wrong. In case of not using FaissVectorStore, I still create some indexes and everything still works.

doc: https://docs.llamaindex.ai/en/stable/examples/vector_stores/FaissIndexDemo/

WWhiteFang_Jr

Maybe during indexing something went haywire I guess. Could you try creating a new index and then try again

WWhiteFang_Jr

I would suggest you to use any of the observability tool that way we can exactly identify the root cause which is causing this delay.

Also if you have large number of records ( like really large ) keeping them in memory unless you have GPU would make the the process a bit slow.

You can try for Qdrant, I feel its much better than Faiss

BBrent

I tried it and got the same result. I can't figure out what the cause is yet.

Create Index:
vector_store = FaissVectorStore(faiss_index=faiss_index)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
# service_context = ServiceContext.from_defaults(llm=llm1, chunk_size=512, chunk_overlap=50)

docs = []
doc = SimpleDirectoryReader(
input_files=[f"{Constants.docs_path}/{directory_path}/{name_file}"]
).load_data()

doc[0].doc_id = name_file
docs.extend(doc)

# create index
index = VectorStoreIndex.from_documents(doc,
# service_context=service_context,
storage_context=storage_context,
)

# Directory in which the indexes will be stored
index.storage_context.persist(persist_dir=f"{Constants.index_path}/{directory_path}")

BBrent

I am not using GPU at the moment

BBrent

Is there any fee to use Qdrant currently?

WWhiteFang_Jr

Yes, Qdrant cloud gives 1GB space lieftime free

WWhiteFang_Jr

Try with Qdrant once and add a observability tool to get details on the time taken by each step in between. That will help to indentify the step which is causing time delay

WWhiteFang_Jr

https://docs.llamaindex.ai/en/stable/module_guides/observability/#arize-phoenix-local

BBrent

Thanks i will try. But can you check what is wrong with my above error. i have sent you my code. and below is what i got out:

pull_path = f"{directory_path}/{folder}"
vector_store = FaissVectorStore.from_persist_dir(f"{pull_path}")
storage_context = StorageContext.from_defaults(
vector_store=vector_store,
persist_dir=pull_path
)
index = load_index_from_storage(storage_context)
vector_retriever = index.as_retriever(similarity_top_k=6)
return vector_retriever, index.docstore.docs.values()

///QueryFusionRetriever////
retriever = QueryFusionRetriever(
vector_retrievers,
similarity_top_k=6,
num_queries=result, # Giảm số lượng truy vấn được tạo ra
# mode="simple",
mode="reciprocal_rerank",
use_async=True,
verbose=True,
)

WWhiteFang_Jr

Are you creating retriever for each document?

BBrent

That's right

BBrent

Because I think we should also choose some suitable retrievers for the question first and then query.

WWhiteFang_Jr

Have you tried creating a single retriever for all the documents and tried accuracy for the same ?

WWhiteFang_Jr

I have a usecase in my org, I have around 1K+ documents behind a single retriever and it works quite well

BBrent

I tried merging all indexes but the query is still slower than splitting it into multiple indexes.

BBrent

I think using a single retriever for all is similar. I am having the same problem now as I asked in the beginning, only 5 documents and it is already so slow.

WWhiteFang_Jr

please check with the observability tool once

WWhiteFang_Jr

with single retriver this part is removed right ?

Plain Text

///QueryFusionRetriever////
retriever = QueryFusionRetriever(
            vector_retrievers,
            similarity_top_k=6,
            num_queries=result,  # Giảm số lượng truy vấn được tạo ra
            # mode="simple",
            mode="reciprocal_rerank",
            use_async=True,
            verbose=True,

BBrent

I don't understand what you mean.

BBrent

Currently I am using QueryFusionRetriever

WWhiteFang_Jr

Okay and with that I'm assuming num_queries=result, # Giảm số lượng truy vấn được tạo ra this value is never 0 right ?

What happens here is llm forms the number of questions that are mentioned here , Try setting it to 0

WWhiteFang_Jr

I think this will reduce the time for you

BBrent

ok i will try

BBrent

It reduces response time but only partially.

BBrent

@WhiteFang_Jr I want to ask you another question when I get it by load_index_from_storage . I want to know if there is a way to cache it so that it loads faster next time.

WWhiteFang_Jr

Yes you would need to use a vector DB: Like Qdrant

WWhiteFang_Jr

Have you used Observability tool? try with that once

BBrent

I have never used it. Let me find out.

BBrent

@WhiteFang_Jr I have a question after using chat_engine.stream_chat() which returns source_nodes. I am seeing many nodes returned. How can I choose the most suitable node for the answer? Because I see there will be many nodes that are not related in terms of data.

BBrent

And sometimes there will be answers that according to the data, there will be no node that matches it.

WWhiteFang_Jr

I'm a bit confused here tbh, if possible could you tell the problem statement that you are trying to work on and What is the need to create retriever per document and did you get a chance to try observability too?

BBrent

The reason is that I have about 100 documents. Currently divided into about 12 topics. I need to solve the problem that when the user uses the chatbot to ask, it will classify it to the topic that matches the question and then I will select a few retrievers to answer the user's question. Before, I tried to put all 100 documents into 1 index to use, but it took too long, so I separated it like this.

BBrent

I have not used the Observability tool yet.

WWhiteFang_Jr

I feel you should try putting them under one index once and then try one more time. And do try Observability module it will help us to identify the part of RAG which is taking the more amount of time

BBrent

I haven't used Observability. But I printed the timings and saw that when I get the vector_retriever and docstore, it takes a few seconds to get them.

BBrent

Can you tell me what is the problem if using like this and if using just one index again is there any problem.

BBrent

I need your advice

BBrent

I just tried combining small indexes into one big index. by using

all_nodes = []
for index in indices:
nodes = index.storage_context.docstore.get_nodes(list(index.index_struct.nodes_dict.keys()))
all_nodes.extend(nodes)

combined_index = VectorStoreIndex(nodes=all_nodes)

The result returned load combined_index is 19s and answer in answer in 2s. but if using the old then 14s and 3s to finish answering

WWhiteFang_Jr

getting answer in 2 sec is quite good

WWhiteFang_Jr

Also can you show how you make your chat engine and querying as well

BBrent

retriever = combined_index.as_retriever(similarity_top_k=6)

chat_engine = CondensePlusContextChatEngine.from_defaults(
retriever,
llm=llm,
context_prompt=prompt_tmpl,
verbose=False,
chat_history=custom_chat_history,
condense_prompt=condense_prompt,
# node_postprocessors=[node_postprocessor],
skip_condense=True,
service_context=service_context
)

chat_engine.stream_chat(input_text)

WWhiteFang_Jr

looks good, and I think 2 sec time is great

WWhiteFang_Jr

btw no need for service_context

BBrent

service_context is used to count tokens used. The problem is that it takes too long to load the index.

WWhiteFang_Jr

Try qdrant

Add a reply