Duplicates

At a glance

The community members discuss how PGVector stores data and avoids duplicates. The main points are:

1. There is no duplicate checking by default, and it would be up to the user to handle this.

2. Each document is chunked into nodes, which will have different IDs.

3. The community member can manually enable the docstore and use refresh to help with avoiding duplicates, but it is complicated.

4. A community member suggests adding a custom node_id function to the node parser, which could generate unique IDs based on the document ID and node index. This could then be enforced with a unique constraint in the database schema.

The community member is open to submitting a pull request to implement this feature if it would be accepted.

Useful resources

WWizboar

How does PGVector store avoid duplicates? Does it need the docstore to do so?

I am setting the doc = Document(_id="some calculated unique id hash") and I am nott sure if the database will fail an integrity error if the _id is not unique for some doc_id and generated node as the node_id is a random id...

5 comments

LLogan M

There is no duplicate checking by default, that would be up to you

Each document is chunked into nodes, which will have different ids

LLogan M

You can manually enable the docstore and use refresh to help with this, but it's a tad complicated right now

LLogan M

https://discord.com/channels/1059199217496772688/1163880111074971790/1163900056718553169

WWizboar

It would be cool if we could parse a node_id function into the node_parser so that we can customise this behaviour.

I was generating a node_id from a document in my implementation before by doing the following.

node_id = f'{doc_id}__{i}

Where i was the index of the node from the split.

function could just be: node_id(idx:int, doc: Document)

Then by adding a unique constraint on the postgres schema this can be enforced at the data layer.

WWizboar

Happy to help with a PR for this if its something that would get merged if worked

Add a reply

Find answers from the community

Duplicates