I want to use RAG over my data. I am using OpenSearch v...

At a glance

The community member is using OpenSearch vector DB for their vector store and has built an index, storing content along with embeddings. They want to add more documents to the index, but found that while the new data is added to the index, it is not added to the vector store. They tried using the index.refresh_ref_docs() method but it did not resolve the issue.

The comments provide some guidance, suggesting that for a vector store index, the from_documents method should be used to build the index from a list of documents, rather than using index.insert. It also notes that when adding data to an existing index, the from_documents method can be used with the URI of the existing index.

However, the comments also indicate that when using a different vector store like OpenSearch, the community member may need to ensure the new documents are properly inserted into the underlying vector store, and that the documentation for OpenSearch or the specific vector store being used should be consulted.

The comments suggest that the community member should persist the data locally first for the initial set of files, and then use the refresh method to update the vector store and index. There is also discussion around using an IngestionPipeline with

Useful resources

cchaitanya

@kapa.ai I want to use RAG over my data. I am using OpenSearch vector DB for vector store. I built an index and stored the content alongwith embeddings. Now I want to add some more documents to this, I used index.insert method and observed that it is stored in index. But I found that the new data is added in index but it is not added in vector store. I did index.refresh_ref_docs() as well. What am I missing here?

14 comments

kkapa.ai

Please stand by @chaitanya as I search the knowledge base. I normally take about 20 seconds or so. :wave:

kkapa.ai

----------------------
Based on the information provided in the knowledge sources, the insert method is used to add new documents to an existing index. However, the underlying mechanism behind insertion depends on the index structure. For the vector store index, a new Document (and embeddings) is inserted into the underlying document/embedding store.

Here is an example of how to insert documents into an index:

Plain Text

from llama_index.core import VectorStoreIndex

index = VectorStoreIndex([])
for doc in documents:
    index.insert(doc)

(Source: LlamaIndex > En > Latest > Understanding > Storing > Storing > Inserting Documents or Nodes ())

However, it's important to note that if you're using a Vector Store Index, you should use the from_documents method to build the index from a list of documents:

Plain Text

from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)

(Source: LlamaIndex > En > Latest > Understanding > Indexing > Indexing > Vector Store Index > Using Vector Store Index ())

If you're adding data to an existing index, you can use the from_documents method and pass the URI of your existing index:

Plain Text

index = VectorStoreIndex.from_documents(documents, uri="/tmp/new_dataset")

(Source: Jerryjliu > Llama_index > Blob > Main > Docs > Docs > Examples > Vector_stores > LanceDB Vector Store > Appending data ())

kkapa.ai

If you're using a different vector store like OpenSearch, you might need to ensure that the new documents are properly inserted into the underlying vector store. The knowledge sources do not provide specific information on how to do this with OpenSearch. You might need to refer to the OpenSearch documentation or the specific instructions provided by the vector store you're using.
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------

cchaitanya

Does anyone know answer to this ? @Logan M

KKettle Puffin

Persist the data

cchaitanya

Persist will save on local disk. I am trying to maintain everything on the opensearch vector store.

LLogan M

you can't use refresh without maintaining a docstore. It can't rely only on a vector store, there needs to be a layer on top

cchaitanya

So, I have to persist the data locally first for the initial set of files, then after addition of new files, I have to use refresh method, it will update the vector store, index(memory) . Is this understanding correct?

ddanathan

Is there current documentation on this? All links seem to point here: https://docs.llamaindex.ai/en/latest/how_to/index_structs/document_management.html ... but that's a 404 at this point.

ddanathan

looks like this is the current version: https://docs.llamaindex.ai/en/stable/module_guides/indexing/document_management/ and then if you want to look at pipelines, there's this: https://docs.llamaindex.ai/en/stable/examples/ingestion/document_management_pipeline/

ddanathan

and if you want to build a summary index, you switch to using the SummaryExtractor as depicted here: https://docs.llamaindex.ai/en/stable/examples/ingestion/async_ingestion_pipeline/

LLogan M

(A summary extractor is not required for a summary index)

ddanathan

Is there a way to use an IngestionPipeline in concert with a DocumentSummaryIndex? It looks like you can either use the DocumentSummaryIndex (or similar classes) or you can hook a pipeline up to a raw vector & doc store and define the transform steps, including summarization. Though the summary extractor puts the summary in metadata rather than replacing the text.

LLogan M

You can provide the transformations you want to run to the document summary index

DocumentSummaryIndex.from_documents(documents, transformations=[SentenceSplitter()])

In this case, any metadata extractors aren't really used/needed tbh

Point of the document summary index is to summarize a document and use that summary for retreival, and the actual nodes are processed and stored for when that document is selected.

By contrast, a summary extractor puts the summary into metadata, which can influence the embeddings (but the original node text is also still embedded)

Add a reply

Find answers from the community

I want to use RAG over my data. I am using OpenSearch vector DB for vector store. I built