Find answers from the community

Updated 2 weeks ago

Updating Documents in Qdrant: Handling Deduplication and Versioning

Undering upserting documents in Qdrant. Suppose I have a google doc (GD) which was already inserted in MongoDB docstore and Qdrant vector store. Now suppose the google doc (GD) was edited and we want to update it in MongoDB docstore and Qdrant vector store. How do we make sure that we do not end up creating duplicate documents.

I heard that we can use doc_id as a field in the metadata for LlamaIndex document and that will help dedupe but how will that work if document is say 1000 pages i.e. the document is broken into multiple nodes. How does doc_id translate to multiple node identifier to identify which node to update in MongoDB docstore and Qdrant vector store?

If we as user should set node_id directly then any guidance into how to generate node_id will be super helpful.
L
l
8 comments
You can do vector_store.delete(doc_id) to delete all nodes that had that doc id listed as a source
then insert your new nodes
issue with vector_store.delete(doc_id) approach is this case - if we remove the document from the index and the ingestion job fails and the upsert didn't work. In the meantime, our serving process will be missing this document and will be working with sub-optimal results.
Probably I would only remove it after inserting the new ones? You'd need to tweak the source doc_id each time (maybe just adding v1/v2/v3 to it?)

the only other approach is to have consistent node_ids and re-add them. This is tougher with qdrant because it needs UUID for node ids

So something like

Plain Text
nodes = vector_store.get_nodes(node_ids=[...])
# do something to change them?
vector_store.add(nodes)
these are the only approaches I can think of
yay vector dbs
we are not tied with qdrant. We can move to some other store if that will be better. I was looking at id_func is that something that will be helpful?
I think every vector db will have the same limitation (minus the node ids being uuid, thats on qdrant). Youll have to go with one of the above approaches I think, or brainstorm a new one

Not sure what id_func is actaully, I'd have to look that up
Add a reply
Sign up and join the conversation on Discord