Find answers from the community

Updated 3 months ago

[Question]: How to insert/delete documen...

Can someone please help me on this query ? https://github.com/run-llama/llama_index/issues/14756

It has been in github open issues since last two months
1
L
L
D
32 comments
You are using the ingestion pipeline, so as long as your input documents have the same document ids for the same document, it will upsert properly. The ingestion pipeline is already inserting into your docstore and vector store, no need to do this twice.

You can also manually manage your data.

index.delete(ref_doc_id) will delete using the original input document ids

You can also delete using the vector store and docstore directly

Plain Text
vector_store.delete(ref_doc_id)
docstore.delete_ref_doc(ref_doc_id)


You can also delete nodes (not every vector store implements this yet, qdrant does tho)

Plain Text
vector_store.delete_nodes(node_ids=[...])
for node_id in node_ids:
  docstore.delete_document(node_id)
Logan this is a good input...just to understand how the doc_ids system is working...If I load a document, say a .txt file, than I modify it, than I reaload it, will the id be the same automatically, so it will be upserted?
if you set the id to be the same (like setting it to be a file name), then yes
yes* (if you use an ingestion pipeline with docstore + vector store)
Thank you for sharing the details. I was wondering how can I get the ref doc id based on the filename. For instance, user selects a file from UI (I have filename), how can I use it to get the ref doc id? I am struggling to find a method in ingestion pipeline where i can fetch the ref doc id of that file name and then execute the ops
Is there a way to get associations of filename to ref_doc_id
Why I try to access the ref_doc_infoproperty on index object (VectorStoreIndex class object) it gives me NotImplementedError: Vector store integrations that store text in the vector store are not supported by ref_doc_info yet. error.
If I initialize the vector store with _store_nodes_override: True it gives me empty dict {} when I access the ref_doc_info property
the vector store index initialization looks like this

Plain Text
index = VectorStoreIndex.from_vector_store(
            self.vector_store,
            Settings.embed_model,
            store_nodes_override=True
        )
I am using qdrant vector db in my application
Plain Text
IngestionPipeline(
            transformations=[self.parser],
            docstore=self.docstore,
            vector_store=self.qdrant_search.vector_store,
            cache=IngestionCache(
                cache=RedisCache.from_host_and_port(
                        host=self.config.redis_config.host, 
                        port=self.config.redis_config.port
                    ),
                    collection="redis_cache",
                ),
            
            docstore_strategy=DocstoreStrategy.UPSERTS,
        )
no need to set store nodes override, the ingestion pipeline is already doing all the work in your setup

You just need to make sure your input documents have consistent IDs

If you are using simple directory reader, you can do something like SimpleDirectoryReader(..., filename_as_id=True)
Got it, thanks a lot!!
@Logan M Still facing one issue. The docstore is not updated after deletion. The vector store gets updated automatically once the nodes are deleted. Do I need to call explicitely a method to refresh the docstore to ensure it has deleted the references from the docstore post deletion?
Plain Text
                # Delete from docstore
                logger.info(f"Deleting document with ids from docstore: {ref_doc_ids_to_delete}")
                for ref_doc_id in ref_doc_ids_to_delete:
                    try:
                        logger.info("{} | {} | {}".format(self.ingestion.docstore._ref_doc_collection, self.ingestion.docstore._metadata_collection, self.ingestion.docstore._node_collection))
                        await self.ingestion.docstore.adelete_ref_doc(ref_doc_id, raise_error=True)
                    except Exception as e:
                        logger.warning(f"Docstore deletion warning for {ref_doc_id}: {str(e)}")
By deletion, what do you mean in this case?
The above code seems mostly correct?
What I intend to do is following

I have ref_doc_ids_to_delete list , I want to delete the ref doc ids from both docstore and vectorstore and it should reflect in the qdrant db (collection) and redis docstore (metadata, doc collections) should be updated)
docstore_strategy=DocstoreStrategy.UPSERTS, I have updated this to docstore_strategy=DocstoreStrategy.UPSERTS_AND_DELETE
Current issue is that upon deleting ref doc ids, vector store is showing the updated state when I check the collection items.

However, when I check redis collection (docstore) the old file references still exist
Docstore Initialization

Plain Text
self.docstore = RedisDocumentStore.from_host_and_port(
            host=self.config.redis_config.host, port=self.config.redis_config.port, namespace="xxx"
        )


Plain Text
self.vector_store = QdrantVectorStore(**vector_store_config)


Plain Text
        self.index = VectorStoreIndex.from_vector_store(
            self.vector_store,
            Settings.embed_model,
            store_nodes_override=True,
)
I have tried everything but nothing seem to work, any help would be greatly appreciated.
Plain Text
self.vector_store = QdrantVectorStore(**vector_store_config)


Plain Text
        self.index = VectorStoreIndex.from_vector_store(
            self.vector_store,
            Settings.embed_model,
            store_nodes_override=True,
)
I found one issue. The docstore linked with ingestion pipeline object and qdrant vector store index object are different

<llama_index.storage.docstore.redis.base.RedisDocumentStore object at 0x7f989f7985e0>
<llama_index.core.storage.docstore.simple_docstore.SimpleDocumentStore object at 0x7f98a4636820>
The combination of QdrantVectorStore and RedisDocumentStore is not working together. After reviewing the code, it seems the VectorStoreIndex does not support key value docstores hence the storage context is overriden by the default docstore (SimpleDocumentStore). Please correct me if I am wrong @Logan M
That's not correct -- as long as you attach the vector store and docstore to the ingestion pipeline (and you save the docstore somewhere!) It should work fine.

It's pretty hard to debug without seeing some minimum version of your code flow
Hello! I am reading through this thread and the original github issue. It looks like the github issue suggests that upserts are virtually just delete and re-inserts.
Reading this thread it sounds like IngestionPipeline should handle upserts automatically if input documents have consistent IDs.
So I wonder if anyone could help me clarify which is true, and if I'm doing something wrong.

I am setting a deterministic id_ on my Document list as I pass into pipeline.run(documents=documents). This results in the doc_id and ref_doc_id property reflecting that custom ID but the actual id in Weaviate is different and therefore the document is duplicated in the Vector Store every time I run the ingestion.

Here is a sample:

storage_context = StorageContext.from_defaults(vector_store=vector_store)

pipeline = IngestionPipeline(
transformations=transformers,
vector_store=vector_store,
)
nodes = pipeline.run(documents=documents)

VectorStoreIndex(
nodes=nodes,
storage_context=storage_context,
show_progress=True,
embed_model=Settings.embed_model
)
An upsert is a delete + reinsert yes

The ingestion pipeline does handle this, assuming you attached both a docstore and vector store to it
I see, so if I do not use a doc store then I cannot expect it to work that way
I am not using a doc store... so I will take that as my answer. Thank you!
Add a reply
Sign up and join the conversation on Discord