Find answers from the community

Updated 3 months ago

Duplicates

How does PGVector store avoid duplicates? Does it need the docstore to do so?

I am setting the doc = Document(_id="some calculated unique id hash") and I am nott sure if the database will fail an integrity error if the _id is not unique for some doc_id and generated node as the node_id is a random id...
L
W
5 comments
There is no duplicate checking by default, that would be up to you

Each document is chunked into nodes, which will have different ids
You can manually enable the docstore and use refresh to help with this, but it's a tad complicated right now
It would be cool if we could parse a node_id function into the node_parser so that we can customise this behaviour.

I was generating a node_id from a document in my implementation before by doing the following.

node_id = f'{doc_id}__{i}

Where i was the index of the node from the split.

function could just be: node_id(idx:int, doc: Document)

Then by adding a unique constraint on the postgres schema this can be enforced at the data layer.
Happy to help with a PR for this if its something that would get merged if worked
Add a reply
Sign up and join the conversation on Discord