Find answers from the community

Updated 3 months ago

MMR

carrying on from this discussion. https://github.com/run-llama/llama_index/issues/10682#issuecomment-2087996422. I fundamentally have an issue where my retriever is pulling a ton of duplicate data from the database. I was thinking that the MMR could help my issue. What do you think I should do? Should I spend time trying to check data going into the vector store is not duplicated multiple times or do you think I should implant a MMR. Don’t others have this issue? If you continually ingest data do you really have time to check for duplicates during the indexing?
c
k
L
9 comments
@kapa.ai when ingesting documents how do i detect if a node has duplicate content from previous engested nodes and to somehow eliminate this?
Let me check the knowledge base for you. One moment! :books:
----------------------
LlamaIndex provides a mechanism to detect and handle duplicate documents during the ingestion process. This is achieved by attaching a docstore to the ingestion pipeline. The ingestion pipeline uses the document.doc_id or node.ref_doc_id as a grounding point and actively looks for duplicate documents.

Here's how it works:

  • It stores a map of doc_id -> document_hash.
  • If a duplicate doc_id is detected, and the hash has changed, the document will be re-processed and upserted.
  • If a duplicate doc_id is detected and the hash is unchanged, the node is skipped.
  • If only a vector store is not attached, it checks all existing hashes for each node. If a duplicate is found, the node is skipped. Otherwise, the node is processed.
Please note that if a vector store is not attached, the system can only check for and remove duplicate inputs.

Here is a sample code snippet on how to attach a docstore to the ingestion pipeline:

Plain Text
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.storage.docstore import SimpleDocumentStore

pipeline = IngestionPipeline(
    transformations=[...], docstore=SimpleDocumentStore()
)


For more details, you can refer to the LlamaIndex documentation () and the example notebook ().
----------------------
Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:
----------------------
If you can ensure that source has a consistent document id, its pretty straightfoward to check for duplicates or modified documents
an ingestion pipeline with a docstore and vector store attached will record ingested document IDs and their hashes

If a document with the same ID is ingested again, the hash is used to determine if it should be skipped or upserted
@Logan M i am using a different strategy. What I am doing is taking snippets of text and loading them individually in as Documents. This means I could have a snippet that is exactly the same as some other source document. Is there a way to check for the similarity as it is ingested somehow?
you could set the document ID as the hash of the text (note that this wouldn't catch cases where the same "document" was updated with new text)
no that's ok, I am just doing a complete reset and building the index from scratch
i do like the hash idea, I think that could work
Add a reply
Sign up and join the conversation on Discord