I have a collection of long documents

At a glance

The community member has a collection of long documents that they are trying to chunk and index using a SentenceSplitter. They want to ensure that the RAG retrieval returns results at a document level, rather than just the best matching nodes. The concern is that if they only get the top matching nodes, they may end up with 10 segments/nodes from the same 'very rude conversation', instead of getting a list of the best matching documents.

The community members discuss a few options, including writing a custom node-postprocessor to iterate over retrieved nodes and use the node.ref_doc_id or node metadata to find the parent document. However, this would still require the community member to filter the responses, rather than incorporating this concept into the retrieval and scoring process itself.

The community member is using OpenSearch as the backend and was hoping to do something similar to what is supported in Elastic, where the vector search response can return the top K documents directly instead of just the top K segments. The issue they are trying to overcome is if they retrieve the 'top 5' hits but all five hits come from the same document, then they end up only looking at 1 conversation.

As a workaround, the community member suggests increasing the number of results to a very large number and then iterating through them. Another

Useful resources

cchsurf

I have a collection of long documents, which I am trying to chunk and index with a SentenceSplitter. How can I help to ensure that RAG retrieval returns results at a document level?

To be clear, what I mean is, I want to index chunks, but when I perform a search like, "find me conversations with rude speakers", that I get back a list of the best matching documents and not just the best matching nodes as the latter may result in just returning 10 segments/nodes from the same 'very rude conversation'.

The chunking helps with search, but does llama-index provide any sort of built in mechanism to facilitate the retrieval filtering?

Ideally my goal is to support things like "find me the rudest conversations from last week" or something along those lines. I'm close but this document-level post-filter has got me stuck.

7 comments

LLogan M

You can write a custom node-postprocessor to iterate over retrieved nodes, and use node.ref_doc_id or the node metadata to find the parent document

LLogan M

https://docs.llamaindex.ai/en/stable/core_modules/query_modules/node_postprocessors/usage_pattern.html#custom-node-postprocessor

cchsurf

@Logan M thanks, but I guess this would still require me to filter the responses, as opposed to incorporating this concept into the retrieval and scoring process itself?

cchsurf

I'm using OpenSearch as the backend and I was hoping to do something similar to what is supportedin Elastic, where the vector search response can return the top K documents directly instead of just the top K segments

cchsurf

The issue I"m trying to overcome is if I retrieve the 'top 5' hits but all five hits come from the same document, then I end up only looking at 1 conversation.

cchsurf

I guess for now I can just increase N to very large number and iterate though.

LLogan M

Yea that would be the workaround

Under the hood, that's basically exactly what I would implement in the library if I was to add this feature 😆

Add a reply

Find answers from the community

I have a collection of long documents