Find answers from the community

Updated last week

Optimizing Response Time for CondensePlusContextChatEngine

Hi @all.
I am using CondensePlusContextChatEngine for chatting, but I am facing an issue where the response time is too long, averaging over 12 seconds to return an answer. How can I optimize this response time?
W
B
L
50 comments
CondensePlusContextChatEngine provide response in two step:
  • Condenses a question based on user query and previous chat history.
  • fetches nodes and then generates an answer based on fetched nodes and updated query.
If you are using open-source llms then it may be taking long as these two steps makes llm call
Yes but I find it a bit long. How can we improve the speed and the answer is still the same. I have researched and there is a solution to reduce nodes but sometimes reducing nodes does not get the desired results.
The main bottle neck is the llm calls. There is not way to speed that up execpt using either a faster llm, or prompting the llm to write shorter responses

openai, anthropic, etc run very fast

local llms will depend on your hardware
I'm currently using gpt-4o-mini but it sometimes takes nearly 10 seconds to query stream_chat, maybe because my document is heavy. Is there any other way to optimize it?
@Logan M @WhiteFang_Jr

I tried FaissVectorStore and saw speed improvement, but when I use multiple indexes via QueryFusionRetriever and CondensePlusContextChatEngine, I often get error '2', '6', '141' when using chat_engine.stream_chat(input_text). I don't understand where I went wrong. In case of not using FaissVectorStore, I still create some indexes and everything still works.

doc: https://docs.llamaindex.ai/en/stable/examples/vector_stores/FaissIndexDemo/
Maybe during indexing something went haywire I guess. Could you try creating a new index and then try again
I would suggest you to use any of the observability tool that way we can exactly identify the root cause which is causing this delay.

Also if you have large number of records ( like really large ) keeping them in memory unless you have GPU would make the the process a bit slow.

You can try for Qdrant, I feel its much better than Faiss
I tried it and got the same result. I can't figure out what the cause is yet.

Create Index:
vector_store = FaissVectorStore(faiss_index=faiss_index)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
# service_context = ServiceContext.from_defaults(llm=llm1, chunk_size=512, chunk_overlap=50)

docs = []
doc = SimpleDirectoryReader(
input_files=[f"{Constants.docs_path}/{directory_path}/{name_file}"]
).load_data()

doc[0].doc_id = name_file
docs.extend(doc)

# create index
index = VectorStoreIndex.from_documents(doc,
# service_context=service_context,
storage_context=storage_context,
)

# Directory in which the indexes will be stored
index.storage_context.persist(persist_dir=f"{Constants.index_path}/{directory_path}")
I am not using GPU at the moment
Is there any fee to use Qdrant currently?
Yes, Qdrant cloud gives 1GB space lieftime free
Try with Qdrant once and add a observability tool to get details on the time taken by each step in between. That will help to indentify the step which is causing time delay
Thanks i will try. But can you check what is wrong with my above error. i have sent you my code. and below is what i got out:

pull_path = f"{directory_path}/{folder}"
vector_store = FaissVectorStore.from_persist_dir(f"{pull_path}")
storage_context = StorageContext.from_defaults(
vector_store=vector_store,
persist_dir=pull_path
)
index = load_index_from_storage(storage_context)
vector_retriever = index.as_retriever(similarity_top_k=6)
return vector_retriever, index.docstore.docs.values()

///QueryFusionRetriever////
retriever = QueryFusionRetriever(
vector_retrievers,
similarity_top_k=6,
num_queries=result, # Giảm số lượng truy vấn được tạo ra
# mode="simple",
mode="reciprocal_rerank",
use_async=True,
verbose=True,
)
Are you creating retriever for each document?
That's right
Because I think we should also choose some suitable retrievers for the question first and then query.
Have you tried creating a single retriever for all the documents and tried accuracy for the same ?
I have a usecase in my org, I have around 1K+ documents behind a single retriever and it works quite well
I tried merging all indexes but the query is still slower than splitting it into multiple indexes.
I think using a single retriever for all is similar. I am having the same problem now as I asked in the beginning, only 5 documents and it is already so slow.
please check with the observability tool once
with single retriver this part is removed right ?

Plain Text
///QueryFusionRetriever////
retriever = QueryFusionRetriever(
            vector_retrievers,
            similarity_top_k=6,
            num_queries=result,  # Giảm số lượng truy vấn được tạo ra
            # mode="simple",
            mode="reciprocal_rerank",
            use_async=True,
            verbose=True,
        
I don't understand what you mean.
Currently I am using QueryFusionRetriever
Okay and with that I'm assuming num_queries=result, # Giảm số lượng truy vấn được tạo ra this value is never 0 right ?

What happens here is llm forms the number of questions that are mentioned here , Try setting it to 0
I think this will reduce the time for you
ok i will try
It reduces response time but only partially.
@WhiteFang_Jr I want to ask you another question when I get it by load_index_from_storage . I want to know if there is a way to cache it so that it loads faster next time.
Yes you would need to use a vector DB: Like Qdrant
Have you used Observability tool? try with that once
I have never used it. Let me find out.
@WhiteFang_Jr I have a question after using chat_engine.stream_chat() which returns source_nodes. I am seeing many nodes returned. How can I choose the most suitable node for the answer? Because I see there will be many nodes that are not related in terms of data.
And sometimes there will be answers that according to the data, there will be no node that matches it.
I'm a bit confused here tbh, if possible could you tell the problem statement that you are trying to work on and What is the need to create retriever per document and did you get a chance to try observability too?
The reason is that I have about 100 documents. Currently divided into about 12 topics. I need to solve the problem that when the user uses the chatbot to ask, it will classify it to the topic that matches the question and then I will select a few retrievers to answer the user's question. Before, I tried to put all 100 documents into 1 index to use, but it took too long, so I separated it like this.
I have not used the Observability tool yet.
I feel you should try putting them under one index once and then try one more time. And do try Observability module it will help us to identify the part of RAG which is taking the more amount of time
I haven't used Observability. But I printed the timings and saw that when I get the vector_retriever and docstore, it takes a few seconds to get them.
Can you tell me what is the problem if using like this and if using just one index again is there any problem.
I need your advice
I just tried combining small indexes into one big index. by using

all_nodes = []
for index in indices:
nodes = index.storage_context.docstore.get_nodes(list(index.index_struct.nodes_dict.keys()))
all_nodes.extend(nodes)

combined_index = VectorStoreIndex(nodes=all_nodes)

The result returned load combined_index is 19s and answer in answer in 2s. but if using the old then 14s and 3s to finish answering
getting answer in 2 sec is quite good
Also can you show how you make your chat engine and querying as well
retriever = combined_index.as_retriever(similarity_top_k=6)

chat_engine = CondensePlusContextChatEngine.from_defaults(
retriever,
llm=llm,
context_prompt=prompt_tmpl,
verbose=False,
chat_history=custom_chat_history,
condense_prompt=condense_prompt,
# node_postprocessors=[node_postprocessor],
skip_condense=True,
service_context=service_context
)

chat_engine.stream_chat(input_text)
looks good, and I think 2 sec time is great
btw no need for service_context
service_context is used to count tokens used. The problem is that it takes too long to load the index.
Add a reply
Sign up and join the conversation on Discord