Improving Response Time for Summaryindex Using Sentencesplitter
Improving Response Time for Summaryindex Using Sentencesplitter
At a glance
The community member is experiencing slow response times when querying a summary index using a sentence splitter for nodes. They are wondering if there are ways to speed up the process. The comments suggest that a summary index sends everything to the language model, whereas a vector index only sends the top-k results, which may impact response quality. The community members discuss using a routing layer, such as routers, selectors, and agents, to help decide between using a summary index or a vector index, but this adds latency. They also tried using a RouterQueryEngine with a vector store index and summary index, but are still looking for ways to lower the latency, as the summary index takes around 2 minutes to generate a complete summary, compared to 8 seconds for ChatGPT on the same document.
I have been trying to use the summaryindex using sentencesplitter for the nodes but the response time, while querying the index, is slow. I was wondering if there are any ways to speed up the process.
Do you need a summary index? The summary index will send EVERYTHING in the index to the LLM, whereas something like a vector index only sends the top-k
In some ways, yes. The thing is when I am using a vector index and asking the chatbot to completely summarize the full document, then the response quality of the llm is bad, whereas when I am using the summaryindex the llm produces a correct summary of the given document.
Now, the problem is I don't know any other alternative to use as a user can ask anything to it.
You need some kind of layer on top to route between needing a summary index or vector index
We have routers, selectors, and agents, all of which can help solve that π It adds some latency (its an extra LLM call), but it will help with accuracy
Yeah I did that too. I used the RouterQueryEngine with vectorstoreindex and summaryindex to route queries of user. But my question is, are there any ways to lower the latency? Like say for summary index even when I am using streaming response, the average it takes to generate the complete summary is roughly 2 mins. Whereas for the same document ChatGPT takes around 8 sec to generate the complete summary.
Not really. You need some way to decide between "Do I need a summary/all files, or do I need a specific part of my files"
The other options is maybe an embedding/similarity based approach to routing, but for that you need "examples" of types of queries that would belong to each option you want to route between. This is often tricky to do and hard to maintain