Find answers from the community

S
Sachin
Offline, last seen 3 months ago
Joined September 25, 2024
,
When using the llama index ingestion pipeline for document management, for docstore there is a option for UPSERT and other strategies for handling existing and new documents.
so if an existing document is found, it checks the doc_hash value and performs the UPSERT operation if required. This way we always have the accurate documents in the docstore.

But what happens within the vector store?
Are the existing nodes replaced with the new ones for the updated document?

For my case, the number of nodes created for a document increased after that document was updated.
The total nodes now is a combination of old nodes and new nodes.

Is there an option to perform the UPSERT operation for vector nodes as well?
like deleting all the existing nodes and insert the newly created ones?

here is my configuration for the ingestion pipeline for reference -
pipeline = IngestionPipeline(
transformations=[
parser,
title_metadata_extractor,
summary_extractor,
qa_extractor,
embed_model,
],
vector_store=vector_store,
docstore=docstore,
docstore_strategy=DocstoreStrategy.UPSERTS
)
28 comments
S
L
W
S
Sachin
·

Hi team,

Hi team,
Thanks for all the wonderful work you guys have been doing.

I was wondering if someone could help me with one of the queries I had regarding the Ingestion Pipeline and Document Management using Llama Index.

I have explored that docstore is able to remove the duplicate documents when ingested using the Ingestion Pipeline with a Vector Store configured and have experimented around the same as well.

Though does it apply to the Vector Store as well? Meaning that embeddings and other metadata stored for a duplicate documents is removed automatically.

For me it's not happening if this is possible, cause if I ingest 2 documents using the Ingestion Pipeline, the docstore will have 2 documents and if I re-ingest the same documents, the docstore will have 2 documents only but the vector store is working in append mode only and the number of documents (based on nodes) in index store keeps on increasing.

Any help/guidance is much appreciated.

Reference link - https://docs.llamaindex.ai/en/stable/examples/ingestion/document_management_pipeline/
6 comments
S
L