Find answers from the community

Updated 3 months ago

Hi team,

Hi team,
Thanks for all the wonderful work you guys have been doing.

I was wondering if someone could help me with one of the queries I had regarding the Ingestion Pipeline and Document Management using Llama Index.

I have explored that docstore is able to remove the duplicate documents when ingested using the Ingestion Pipeline with a Vector Store configured and have experimented around the same as well.

Though does it apply to the Vector Store as well? Meaning that embeddings and other metadata stored for a duplicate documents is removed automatically.

For me it's not happening if this is possible, cause if I ingest 2 documents using the Ingestion Pipeline, the docstore will have 2 documents and if I re-ingest the same documents, the docstore will have 2 documents only but the vector store is working in append mode only and the number of documents (based on nodes) in index store keeps on increasing.

Any help/guidance is much appreciated.

Reference link - https://docs.llamaindex.ai/en/stable/examples/ingestion/document_management_pipeline/
L
S
6 comments
What vector store are you using? It works fine for me πŸ€·β€β™‚οΈ
Hi @Logan M , I'm using MongoDBAtlasVectorSearch from lama_index.vector_stores.mongodb as the Vector Store.
Here is the snippet of code I'm using for the setup of Ingestion Pipeline.

Imports

from llama_index.llms.openai import OpenAI
from llama_index.core.schema import MetadataMode
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.node_parser import SentenceSplitter, SemanticSplitterNodeParser
from llama_index.core.extractors import TitleExtractor, QuestionsAnsweredExtractor, SummaryExtractor
from llama_index.storage.docstore.mongodb import MongoDocumentStore
from llama_index.core.ingestion import IngestionPipeline
from llama_index.vector_stores.mongodb import MongoDBAtlasVectorSearch

Setup of Ingestion Pipeline

embed_model = OpenAIEmbedding(model=ingestion_pipeline_cfg.embed_model, num_workers=4)
llm = OpenAI(model=ingestion_pipeline_cfg.llm_model, temperature=0.1)
vector_store = MongoDBAtlasVectorSearch(
pymongo.MongoClient(mongo_uri),
db_name=ingestion_pipeline_cfg.db_name,
collection_name=ingestion_pipeline_cfg.collection_name,
index_name=ingestion_pipeline_cfg.index_name
)
docstore = MongoDocumentStore.from_uri(uri=mongo_uri, db_name=ingestion_pipeline_cfg.db_name)

parser = SemanticSplitterNodeParser(buffer_size=1, breakpoint_percentile_threshold=90,
embed_model=embed_model)

title_metadata_extractor = TitleExtractor(llm=llm, metadata_mode=MetadataMode.EMBED, num_workers=4)
summary_extractor = SummaryExtractor(llm=llm, metadata_mode=MetadataMode.EMBED, num_workers=4)
qa_extractor = QuestionsAnsweredExtractor(questions=3, num_workers=4)

pipeline = IngestionPipeline(
transformations=[
parser,
title_metadata_extractor,
summary_extractor,
qa_extractor,
embed_model,
],
vector_store=vector_store,
docstore=docstore
)
Additionally, when I'm using the doc strategy as UPSERT and if try to ingest the same documents into the docstore which already exists, during the parsing step, I'm getting the below error -
documents must be a non-empty list
with following print statements -
Parsing nodes: 0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
Generating embeddings: 0it [00:00, ?it/s]

It seems like since there are no new documents available to process by the Ingestion Pipeline, it's throwing the above error.
Curious to know how this issue can by passed or rectified?
@Logan M Let me know if you have any inputs on this.
@WhiteFang_Jr would appreciate your thoughts on this as well. TIA
Add a reply
Sign up and join the conversation on Discord