Hi all I utilize vector store

At a glance

The community members are discussing the optimal way to use the Retrieval-Augmented Generation (RAG) model for Q&A. The main options are sending the full document(s) or using a similarity search with similarity_top_k. One community member suggests using a SummaryIndex to send all nodes to the LLM, but warns that this will be slow for large documents (150-200 pages). Another community member suggests using a VectorStoreIndex with a similarity top-k, but notes that this may not cover all relevant nodes. The discussion also covers using a TreeIndex to compose a graph of indices, and the potential use of a sub-question engine or router engine. The community members also discuss the challenges of using use_async=True with Bedrock models, and the need for better support for Bedrock in the LLM integrations.

Useful resources

xxrt

Hi all, I utilize vector_store = PGVectorStore.from_params(). When working with the Retrieval-Augmented Generation (RAG) model for Q&A, what's the optimal way: sending the full document(s) instead of using a similarity search using similarity_top_k? How should this be implemented, and what's the most effective approach to take? Thank you

23 comments

LLogan M

I'm not sure what you mean. You want to fetch all data from your vector store?

xxrt

I have an index for each document (one document have 150-200 pages) and I want to be able to support 2 types of queries:

using embeddings semantic search
fetch all data, and send entire document (150-200 pages)

So for case 2. - what's the optimal way to fetch all data and send entire document to LLM ? what's the most effective approach to use ?

LLogan M

for use-case 2, I would use a SummaryIndex (used to be called ListIndex)

Plain Text

from llama_index import SummaryIndex
index = SummaryIndex.from_documents(documents)
query_engine = index.as_query_engine(response_mode="tree_summarize", use_async=True)

It will send all nodes to the LLM -- but be warned, this will be slow for something that large.

LLogan M

The above is the fastest settings possible for this

xxrt

In case I will need to query questions over more documents(different index per document) a TreeIndex can be better ?.., building indices on top of other document indices. Will be a good approach to compose a graph made up of indices and use query the graph ? What will do a better job: agent or sub question engine or router engine ?

LLogan M

I think the sub question engine would be the best pick. And then put that in an agent if you need chat history

But again -- sending 150-200 pages to the LLM will not be fast 😅

xxrt

there is a better option to do that, to be able to respond to a question that require entire document ?

LLogan M

Using a vector index would be the option for that, with a similarity top k

xxrt

but, will not cover all relevant nodes because will be limited by similarity top k number configured

xxrt

if I have index already created as VectorStoreIndex.from_vector_store
how I can use SummaryIndex, I need to recreate a new index ? how to save this new index in database ? Can VectorStoreIndex be used both with similarity top k and as SummaryIndex ?

LLogan M

I think you need to decide whether you want all nodes or only the relevant ones 😅

You can use a router engine to switch between a summy index or a vector index as needed. Typically you only want to use the summary index for queries that require reading the entire index

You can store the summary index in mongodb, redis, s3, google cloud bucket, etc.

Two main ways -- either using fsspec or a docstore/index_store integration
https://gpt-index.readthedocs.io/en/stable/core_modules/data_modules/storage/save_load.html#using-a-remote-backend

https://gpt-index.readthedocs.io/en/stable/core_modules/data_modules/storage/docstores.html

https://gpt-index.readthedocs.io/en/stable/core_modules/data_modules/storage/index_stores.html

xxrt

To reduce the time, there is any way to distribute the load, having parallel processing.. sending each chunk as a separate call to LLM and combining the answers ?

use_async=True is not working as expected

LLogan M

We've been avoiding parallel processing, since it generally creates hard to debug/maintain code

use_async=True should be working fine -- there is a ton of room for concurrancy when waiting for API calls

xxrt

use_async=True it may not work for all models, I'm using Bedrock models ? from langchain.llms.bedrock import Bedrock

LLogan M

ah yea, langchain probably didn't implement async, classic

LLogan M

:press f:

LLogan M

We should probably properly implement bedrock at some point so that we can properly support async there

xxrt

do you have a timeline when bedrock will be implemented ?

LLogan M

nope, I think you are the first person to ask about it from what I remember lol

LLogan M

most integrations are community driven

LLogan M

Because of course there about 60000 LLM services lol

xxrt

Bedrock is in limited preview and not yet GA, when will be.. there will be more interest in using it. Will be good to have native support implemented from my point of view to not have limitations.

LLogan M

Since you have access, if you are up for it, would love a PR. Happy to review/merge 🙂

LLM integrations are fairly easy to add -- basically just a single file to add

Add a reply

Find answers from the community

Hi all I utilize vector store