Find answers from the community

Updated 3 days ago

Ingesting Duplicate Document Chunks with AzStorageBlobReader

Hi I am using AzStorageBlobReader, and while reinitatiing my RAG pipeline, it is ingesting duplicate document chunks, I think its because its putting it in temporary directory because the doc_hash is not changing whereas the doc_id seems to change Any suggestions ?
1
W
p
D
10 comments
Can you share your code?
If the index is already created (locally or with some vector store), checking its existence and then creating the index directly using the vector store client will remove data duplication that you are facing
Pipeline setup
Plain Text
def setup_pipeline(self):
        """Set up the entire pipeline."""
        logger.info(f"Setting up pipeline for {self.tenant}/{self.project}...")
        try:
            # Setup necessary components
            if self.application_config.use_azure_blob:
                documents = self.get_blobs_from_container()
                logger.info(f"{documents[0]}")
                
               




Function to get blobs from the container
Plain Text
def get_blobs_from_container(self):
        
        if self.tenant not in self.azure_blob_manager.list_containers():
            logger.info(f"Creating container for tenant {self.tenant}")
            self.azure_blob_manager.create_container(self.tenant)
            logger.info(f"Uploading data from from {self.tenant}")
            self.azure_blob_manager.upload_directory(self.tenant,self.tenant, self.project, self.application_config.data_path)
            logger.info(f"Loading documents from Existing Azure Blob Storage for tenant {self.tenant} ---> {self.application_config.data_path}")
        reader = AzStorageBlobReader(
                            container_name=self.tenant,
                            connection_string="",
                        )
        documents = reader.load_data()
       
        return documents


Ingesting nodes to vector store

Plain Text
ingestion = IngestionPipeline(
                transformations=[insert_metadata, parser, embed_model],
                vector_store=vector_store,
                docstore=docstore,
                docstore_strategy=DocstoreStrategy.UPSERTS,
            )

            nodes = ingestion.run(
                documents=documents, tenant=self.tenant, project=self.project
            )
Why Temporary Directories Cause Duplicates

New File Paths, New doc_ids
When the AzStorageBlobReader fetches your blobs, it saves the files into a randomly generated temp folder (e.g., /tmp/tmpabc123/your_file.pdf).
Because LlamaIndex sees each file path as unique, it defaults to generating a new doc_id each time—even though the actual file content is the same.

Upsert Logic Relies on doc_id
The upsert process in DocstoreStrategy.UPSERTS primarily checks a document’s doc_id to decide whether to create or update.
If a new doc_id is generated on every run, the ingestion pipeline sees no matching IDs in the store, so it adds “new” nodes (duplicates in content).
Have I understood the flow correctly @Logan M ?
I think so. You'd have to process your documents to have consistent document ids
Got it, so creating a new doc id from original blob file name and assigning it before ingesting nodes should solve the problem, I guess.

Do you think I should create a PR for this for readers dealing with remote blob storage as the current code does not handle this
Also Issue with _load_documents_with_metadata in LlamaIndex's Azure Storage Blob Reader

I noticed that the _load_documents_with_metadata method in base.py returns Document objects without metadata.

Concerns:
Standard Output Consistency – Shouldn’t all data loaders return documents with metadata to maintain consistency across the abstraction?
Data Integrity – Since this class is responsible for handling document ingestion, shouldn’t it preserve metadata instead of omitting it?
I also noticed that the get_metadata(file_name: str) function enforces empty metadata dictionary . Was there a specific reason for this? If so, how should we work around this issue in our use case?
Probably just got missed. Feel free to make a PR
Add a reply
Sign up and join the conversation on Discord