Regarding retrieving similarity data per

At a glance

The community member is trying to retrieve similarity data per document, rather than per node, due to the large number of tokens in a document. They note that the index.as_retriever(similarity_top_k=5) function can retrieve data for each node, but they would like to retrieve data for each document.

In the comments, another community member suggests that each node has a ref_doc_id that points to the ID of the parent document. They recommend that if the original parent documents are stored somewhere, the node can be replaced with its parent document after retrieving.

Another community member agrees that the best way to get a duplicate-free document list is to get 5 chunks and then use the ref_doc_id to remove duplicates.

ＳＳＵＺＵＫＩ

Regarding retrieving similarity data per document, not per node
I know that "index.as_retriever(similarity_top_k=5)" can retrieve data for each node, but I would like to retrieve data for each document.
Due to the large number of tokens in a document, it is not possible to have a one-to-one relationship between a document and a node.
Is there any option or functionality that would allow us to retrieve per document?

Ｓ

3 comments

LLogan M

Each node has a ref_doc_id that points to the id of the parent documents

If you stored the original parent documents somewhere (like our docstore, or wherever you want), you can replace the node with its parent doc after retrieving

ＳＳＵＺＵＫＩ

Thank you.
So the best way to get a duplicate-free document list is to get 5 chunks and then use "ref_doc_id" to remove duplicates.

LLogan M

I think so 🤔

Add a reply

Find answers from the community

Regarding retrieving similarity data per