----------------------
LlamaIndex provides a mechanism to detect and handle duplicate documents during the ingestion process. This is achieved by attaching a
docstore
to the ingestion pipeline. The ingestion pipeline uses the
document.doc_id
or
node.ref_doc_id
as a grounding point and actively looks for duplicate documents.
Here's how it works:
- It stores a map of
doc_id
-> document_hash
. - If a duplicate
doc_id
is detected, and the hash has changed, the document will be re-processed and upserted. - If a duplicate
doc_id
is detected and the hash is unchanged, the node is skipped. - If only a vector store is not attached, it checks all existing hashes for each node. If a duplicate is found, the node is skipped. Otherwise, the node is processed.
Please note that if a vector store is not attached, the system can only check for and remove duplicate inputs.
Here is a sample code snippet on how to attach a
docstore
to the ingestion pipeline:
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.storage.docstore import SimpleDocumentStore
pipeline = IngestionPipeline(
transformations=[...], docstore=SimpleDocumentStore()
)
For more details, you can refer to the
LlamaIndex documentation () and the
example notebook ().
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------