----------------------
To avoid duplication when loading nodes into a vector store and to refresh existing nodes, LlamaIndex provides several mechanisms. Here's how you can handle this:
- Using the Ingestion Pipeline:
The IngestionPipeline in LlamaIndex offers built-in deduplication and refresh capabilities. You can set up the pipeline with a docstore and vector store, and use the
DocstoreStrategy.UPSERTS
strategy to handle updates and avoid duplicates.
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.storage.docstore import SimpleDocumentStore
from llama_index.core.vector_stores import SimpleVectorStore
pipeline = IngestionPipeline(
transformations=[...],
docstore=SimpleDocumentStore(),
vector_store=SimpleVectorStore(...),
docstore_strategy=DocstoreStrategy.UPSERTS
)
This setup will automatically handle deduplication and updates when you run the pipeline.
Document Management- Using the
refresh()
method:
If you're directly managing your index, you can use the
refresh()
method to update existing nodes and add new ones:
refreshed_docs = index.refresh_ref_docs(doc_chunks)
This method will update documents with the same doc
id_
but different content, and insert new documents not already present in the index.
Document Management- Manual management with
update_ref_doc()
:
For more granular control, you can use the
update_ref_doc()
method to update specific documents:
doc_chunks[0].text = "Brand new document text"
index.update_ref_doc(doc_chunks[0])
This approach allows you to update individual documents as needed.