Updating Documents in Qdrant: Handling Deduplication an...

At a glance

The community members are discussing how to update a Google Doc (GD) that has already been inserted into a MongoDB docstore and Qdrant vector store, without creating duplicate documents. They suggest using the doc_id as a field in the metadata for the LlamaIndex document, but are unsure how this would work if the document is broken into multiple nodes. The community members propose two approaches:

1. Deleting all nodes with the same doc_id before inserting the new nodes, but this could lead to missing documents in the serving process if the ingestion job fails.

2. Removing the document only after inserting the new ones, and potentially tweaking the source doc_id (e.g., adding "v1/v2/v3"). Alternatively, they suggest maintaining consistent node_ids and re-adding them, which is more challenging with Qdrant as it requires UUIDs.

The community members also discuss the possibility of using a different vector store if Qdrant is not the best fit, and mention looking into the id_func feature, but are unsure of its applicability.

llogiclord

Undering upserting documents in Qdrant. Suppose I have a google doc (GD) which was already inserted in MongoDB docstore and Qdrant vector store. Now suppose the google doc (GD) was edited and we want to update it in MongoDB docstore and Qdrant vector store. How do we make sure that we do not end up creating duplicate documents.

I heard that we can use doc_id as a field in the metadata for LlamaIndex document and that will help dedupe but how will that work if document is say 1000 pages i.e. the document is broken into multiple nodes. How does doc_id translate to multiple node identifier to identify which node to update in MongoDB docstore and Qdrant vector store?

If we as user should set node_id directly then any guidance into how to generate node_id will be super helpful.

8 comments

LLogan M

You can do vector_store.delete(doc_id) to delete all nodes that had that doc id listed as a source

LLogan M

then insert your new nodes

llogiclord

issue with vector_store.delete(doc_id) approach is this case - if we remove the document from the index and the ingestion job fails and the upsert didn't work. In the meantime, our serving process will be missing this document and will be working with sub-optimal results.

LLogan M

Probably I would only remove it after inserting the new ones? You'd need to tweak the source doc_id each time (maybe just adding v1/v2/v3 to it?)

the only other approach is to have consistent node_ids and re-add them. This is tougher with qdrant because it needs UUID for node ids

So something like

Plain Text

nodes = vector_store.get_nodes(node_ids=[...])
# do something to change them?
vector_store.add(nodes)

LLogan M

these are the only approaches I can think of

LLogan M

yay vector dbs

llogiclord

we are not tied with qdrant. We can move to some other store if that will be better. I was looking at id_func is that something that will be helpful?

LLogan M

I think every vector db will have the same limitation (minus the node ids being uuid, thats on qdrant). Youll have to go with one of the above approaches I think, or brainstorm a new one

Not sure what id_func is actaully, I'd have to look that up

Add a reply

Find answers from the community

Updating Documents in Qdrant: Handling Deduplication and Versioning