I am using similarity top k 10 to query

At a glance

The post discusses the performance of using similarity_top_k=10 to query a vector index, which takes a long time to get the final answer. Community members suggest exploring ways to stream responses after completing the prompt for each selected passage, similar to ChatGPT. They discuss the possibility of manually achieving this by selecting similarity_top_k=1 ten times.

In the comments, community members provide insights and suggestions. They explain that when top_k is 10, the system has to send data to the LLM (up to) 10 times, which is a sequential process. They suggest trying response_mode="tree_summarize" with similarity_top_k=10 to parallelize the LLM calls, and response_mode="compact" to "compact" the prompt with as many retrieved texts as possible.

The community members also discuss the challenge of merging multiple vector index files into a single index, which is an important use case for parallel computing and when documents are uploaded over time. They suggest adding this as a feature request.

Finally, a community member asks a simple question about how to set top_k to 2

Useful resources

eerajasekar

I am using similarity_top_k=10 to query vector index. But it takes long time to get final answer based best matches. Is there a way to stream responses after completing prompt for each selected passage like ChatGPT? If query API doesn't provide that, Is it possible to manually achieve it by selecting similarity_top_k=1 ten times ?

23 comments

LLogan M

When top_k is 10, it has to send data to the LLM (up to) 10 times, and refining the response as it goes, so it is inherently a sequential process (at least as far as I understand it)

Streaming responses has been hot request lately, I wouldn't doubt its being worked on soon

yyoelk

@Logan M interesting, I actually thought it sends all retrieved paragraphs at once. Can you tell how the actual default LLM-query looks like, for the first paragraph and then for the following ones?

yyoelk

Hopefully @jerryjliu0 can shed some light on this

LLogan M

See the text QA prompt and refine prompts from here (line 88-111) for examples of the queries the vector index uses:

https://github.com/jerryjliu/gpt_index/blob/main/gpt_index/prompts/default_prompts.py

jjerryjliu0

@erajasekar we now have async support, try setting response_mode="tree_summarize" when you have similarity_top_k=10! This allows us to parallelize LLM calls

yyoelk

@jerryjliu0 @Logan M Is there a way to retrieve top_k paragraphs using embeddings and send all retrieved paragraphs at once to the LLM to generate an answer? because it may lose context when sent one by one.

jjerryjliu0

yeah try response_mode="compact"

jjerryjliu0

that will "compact" the prompt with as many retrieved texts as possible

yyoelk

Thanks @jerryjliu0 ! How will they be written in the promt in this case?
Regardless I will give it a try. My current problem is that I have an index file (json) per original document and I couldn't find a way to merge them into one index (unless using composability which creates another layer)

jjerryjliu0

@yoelk are you trying to combine everything into one vector index?

jjerryjliu0

with response_mode="compact", each paragraph will be "stuffed" into the prompt up until the prompt is full

yyoelk

@jerryjliu0 yes, as the indices were created in a parallelized way

jjerryjliu0

i see, that's interesting. We don't have full support for "converting" indices to other indices atm but that's something we can look into

yyoelk

@jerryjliu0 what do you mean by converting? I'm not trying to convert them, just want to have one index.json (SimpleVectorIndex) instead of multiple SimpleVectorIndex files

jjerryjliu0

yeah that's what i meant - i was thinking of that as "conversion"

yyoelk

Got it. I think this is an important use case, not just for parallel computing but also when document files are being uploaded over time so that a new index is created at upload

yyoelk

@jerryjliu0 I'll add that in feature requests

jjerryjliu0

out of curiosity why are you creating a new index for each document vs. adding the document to an existing index?

yyoelk

@jerryjliu0 whenever document files are being uploaded by the user, there's a new instance of an AWS lambda that calculates the index and saves it

yyoelk

@jerryjliu0 So per input text file I create an index file in an async way

yyoelk

@jerryjliu0 Happy to hear your thoughts if you have another way of implementing it in a parallelized way and still have one index file at the end of the process

EEugene Booyens

Hi everyone. Is there a simply way to tell the query method on a GPTSimpleVectorIndex to include top_k=2 because it looks like it's set to top_k=1? I couldn't figure this out from the documentation so apologies if it's a silly question. Cheers

LLogan M

Try adding similarity_top_k=2 to the query call 👍

Add a reply

Find answers from the community

I am using similarity top k 10 to query