Ingesting Duplicate Document Chunks with AzStorageBlobReader

Question

Hi I am using AzStorageBlobReader, and while reinitatiing my RAG pipeline, it is ingesting duplicate document chunks, I think its because its putting it in temporary directory because the doc_hash is not changing whereas the doc_id seems to change Any suggestions ?

WhiteFang_Jr · Answer

Can you share your code?

WhiteFang_Jr · Answer

If the index is already created (locally or with some vector store), checking its existence and then creating the index directly using the vector store client will remove data duplication that you are facing

purple_lantern · Answer

Pipeline setup def setup_pipeline(self): """Set up the entire pipeline.""" logger.info(f"Setting up pipeline for {self.tenant}/{self.project}...") try: # Setup necessary components if self.application_config.use_azure_blob: documents = self.get_blobs_from_container() logger.info(f"{documents[0]}") Function to get blobs from the containerdef get_blobs_from_container(self): if self.tenant not in self.azure_blob_manager.list_containers(): logger.info(f"Creating container for tenant {self.tenant}") self.azure_blob_manager.create_container(self.tenant) logger.info(f"Uploading data from from {self.tenant}") self.azure_blob_manager.upload_directory(self.tenant,self.tenant, self.project, self.application_config.data_path) logger.info(f"Loading documents from Existing Azure Blob Storage for tenant {self.tenant} ---> {self.application_config.data_path}") reader = AzStorageBlobReader( container_name=self.tenant, connection_string="", ) documents = reader.load_data() return documentsIngesting nodes to vector storeingestion = IngestionPipeline( transformations=[insert_metadata, parser, embed_model], vector_store=vector_store, docstore=docstore, docstore_strategy=DocstoreStrategy.UPSERTS, ) nodes = ingestion.run( documents=documents, tenant=self.tenant, project=self.project )

Deval · Answer

Why Temporary Directories Cause Duplicates

New File Paths, New doc_ids
When the AzStorageBlobReader fetches your blobs, it saves the files into a randomly generated temp folder (e.g., /tmp/tmpabc123/your_file.pdf).
Because LlamaIndex sees each file path as unique, it defaults to generating a new doc_id each time—even though the actual file content is the same.

Upsert Logic Relies on doc_id
The upsert process in DocstoreStrategy.UPSERTS primarily checks a document’s doc_id to decide whether to create or update.
If a new doc_id is generated on every run, the ingestion pipeline sees no matching IDs in the store, so it adds “new” nodes (duplicates in content).

Deval · Answer

Have I understood the flow correctly @Logan M ?

Logan M · Answer

I think so. You'd have to process your documents to have consistent document ids

Deval · Answer

Got it, so creating a new doc id from original blob file name and assigning it before ingesting nodes should solve the problem, I guess.

Do you think I should create a PR for this for readers dealing with remote blob storage as the current code does not handle this

purple_lantern · Answer

Also Issue with _load_documents_with_metadata in LlamaIndex's Azure Storage Blob Reader

I noticed that the _load_documents_with_metadata method in base.py returns Document objects without metadata.

Concerns:
Standard Output Consistency – Shouldn’t all data loaders return documents with metadata to maintain consistency across the abstraction?
Data Integrity – Since this class is responsible for handling document ingestion, shouldn’t it preserve metadata instead of omitting it?
I also noticed that the get_metadata(file_name: str) function enforces empty metadata dictionary . Was there a specific reason for this? If so, how should we work around this issue in our use case?

purple_lantern · Answer

@Logan M

Logan M · Answer

Probably just got missed. Feel free to make a PR

Find answers from the community

Ingesting Duplicate Document Chunks with AzStorageBlobReader