Find answers from the community

Updated 8 months ago

I want to use RAG over my data. I am using OpenSearch vector DB for vector store. I built

@kapa.ai I want to use RAG over my data. I am using OpenSearch vector DB for vector store. I built an index and stored the content alongwith embeddings. Now I want to add some more documents to this, I used index.insert method and observed that it is stored in index. But I found that the new data is added in index but it is not added in vector store. I did index.refresh_ref_docs() as well. What am I missing here?
2
k
c
K
14 comments
Please stand by @chaitanya as I search the knowledge base. I normally take about 20 seconds or so. :wave:
----------------------
Based on the information provided in the knowledge sources, the insert method is used to add new documents to an existing index. However, the underlying mechanism behind insertion depends on the index structure. For the vector store index, a new Document (and embeddings) is inserted into the underlying document/embedding store.

Here is an example of how to insert documents into an index:

Plain Text
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex([])
for doc in documents:
    index.insert(doc)

(Source: LlamaIndex > En > Latest > Understanding > Storing > Storing > Inserting Documents or Nodes ())

However, it's important to note that if you're using a Vector Store Index, you should use the from_documents method to build the index from a list of documents:

Plain Text
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)

(Source: LlamaIndex > En > Latest > Understanding > Indexing > Indexing > Vector Store Index > Using Vector Store Index ())

If you're adding data to an existing index, you can use the from_documents method and pass the URI of your existing index:

Plain Text
index = VectorStoreIndex.from_documents(documents, uri="/tmp/new_dataset")

(Source: Jerryjliu > Llama_index > Blob > Main > Docs > Docs > Examples > Vector_stores > LanceDB Vector Store > Appending data ())
If you're using a different vector store like OpenSearch, you might need to ensure that the new documents are properly inserted into the underlying vector store. The knowledge sources do not provide specific information on how to do this with OpenSearch. You might need to refer to the OpenSearch documentation or the specific instructions provided by the vector store you're using.
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
Does anyone know answer to this ? @Logan M
Persist will save on local disk. I am trying to maintain everything on the opensearch vector store.
you can't use refresh without maintaining a docstore. It can't rely only on a vector store, there needs to be a layer on top
So, I have to persist the data locally first for the initial set of files, then after addition of new files, I have to use refresh method, it will update the vector store, index(memory) . Is this understanding correct?
Is there current documentation on this? All links seem to point here: https://docs.llamaindex.ai/en/latest/how_to/index_structs/document_management.html ... but that's a 404 at this point.
and if you want to build a summary index, you switch to using the SummaryExtractor as depicted here: https://docs.llamaindex.ai/en/stable/examples/ingestion/async_ingestion_pipeline/
(A summary extractor is not required for a summary index)
Is there a way to use an IngestionPipeline in concert with a DocumentSummaryIndex? It looks like you can either use the DocumentSummaryIndex (or similar classes) or you can hook a pipeline up to a raw vector & doc store and define the transform steps, including summarization. Though the summary extractor puts the summary in metadata rather than replacing the text.
You can provide the transformations you want to run to the document summary index

DocumentSummaryIndex.from_documents(documents, transformations=[SentenceSplitter()])

In this case, any metadata extractors aren't really used/needed tbh

Point of the document summary index is to summarize a document and use that summary for retreival, and the actual nodes are processed and stored for when that document is selected.

By contrast, a summary extractor puts the summary into metadata, which can influence the embeddings (but the original node text is also still embedded)
Add a reply
Sign up and join the conversation on Discord