Hi team,

At a glance

The community member is exploring the Ingestion Pipeline and Document Management using Llama Index. They have observed that the docstore is able to remove duplicate documents, but they are unsure if the same applies to the Vector Store. The community member is using the MongoDBAtlasVectorSearch as the Vector Store and has provided the code snippet for the setup of the Ingestion Pipeline.

In the comments, another community member suggests that the Vector Store works fine for them, but the original poster is still facing an issue where the number of documents (based on nodes) in the index store keeps increasing even when re-ingesting the same documents. Additionally, when using the doc strategy as UPSERT, the community member is encountering an error during the parsing step, as there are no new documents available to process.

The community members are seeking guidance and inputs from the community to resolve these issues.

Useful resources

SSachin

Hi team,
Thanks for all the wonderful work you guys have been doing.

I was wondering if someone could help me with one of the queries I had regarding the Ingestion Pipeline and Document Management using Llama Index.

I have explored that docstore is able to remove the duplicate documents when ingested using the Ingestion Pipeline with a Vector Store configured and have experimented around the same as well.

Though does it apply to the Vector Store as well? Meaning that embeddings and other metadata stored for a duplicate documents is removed automatically.

For me it's not happening if this is possible, cause if I ingest 2 documents using the Ingestion Pipeline, the docstore will have 2 documents and if I re-ingest the same documents, the docstore will have 2 documents only but the vector store is working in append mode only and the number of documents (based on nodes) in index store keeps on increasing.

Any help/guidance is much appreciated.

Reference link - https://docs.llamaindex.ai/en/stable/examples/ingestion/document_management_pipeline/

6 comments

LLogan M

What vector store are you using? It works fine for me 🤷‍♂️

SSachin

Hi @Logan M , I'm using MongoDBAtlasVectorSearch from lama_index.vector_stores.mongodb as the Vector Store.

SSachin

Here is the snippet of code I'm using for the setup of Ingestion Pipeline.

Imports

from llama_index.llms.openai import OpenAI
from llama_index.core.schema import MetadataMode
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.node_parser import SentenceSplitter, SemanticSplitterNodeParser
from llama_index.core.extractors import TitleExtractor, QuestionsAnsweredExtractor, SummaryExtractor
from llama_index.storage.docstore.mongodb import MongoDocumentStore
from llama_index.core.ingestion import IngestionPipeline
from llama_index.vector_stores.mongodb import MongoDBAtlasVectorSearch

Setup of Ingestion Pipeline

embed_model = OpenAIEmbedding(model=ingestion_pipeline_cfg.embed_model, num_workers=4)
llm = OpenAI(model=ingestion_pipeline_cfg.llm_model, temperature=0.1)
vector_store = MongoDBAtlasVectorSearch(
pymongo.MongoClient(mongo_uri),
db_name=ingestion_pipeline_cfg.db_name,
collection_name=ingestion_pipeline_cfg.collection_name,
index_name=ingestion_pipeline_cfg.index_name
)
docstore = MongoDocumentStore.from_uri(uri=mongo_uri, db_name=ingestion_pipeline_cfg.db_name)

parser = SemanticSplitterNodeParser(buffer_size=1, breakpoint_percentile_threshold=90,
embed_model=embed_model)

title_metadata_extractor = TitleExtractor(llm=llm, metadata_mode=MetadataMode.EMBED, num_workers=4)
summary_extractor = SummaryExtractor(llm=llm, metadata_mode=MetadataMode.EMBED, num_workers=4)
qa_extractor = QuestionsAnsweredExtractor(questions=3, num_workers=4)

pipeline = IngestionPipeline(
transformations=[
parser,
title_metadata_extractor,
summary_extractor,
qa_extractor,
embed_model,
],
vector_store=vector_store,
docstore=docstore
)

SSachin

Additionally, when I'm using the doc strategy as UPSERT and if try to ingest the same documents into the docstore which already exists, during the parsing step, I'm getting the below error -
documents must be a non-empty list
with following print statements -
Parsing nodes: 0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
Generating embeddings: 0it [00:00, ?it/s]

It seems like since there are no new documents available to process by the Ingestion Pipeline, it's throwing the above error.
Curious to know how this issue can by passed or rectified?

SSachin

@Logan M Let me know if you have any inputs on this.

SSachin

@WhiteFang_Jr would appreciate your thoughts on this as well. TIA

Add a reply

Find answers from the community

Hi team,

Imports

Setup of Ingestion Pipeline