Find answers from the community

Updated last year

Am I missing something or why isn't

At a glance
The community member is wondering why there isn't a hierarchical sentence node parser, as they have seen good results with sentence window splitting and hierarchical splitting with auto-merging retriever. They suggest combining these approaches, but are unsure if an auto-merger would make sense for sentence-based splitting. The comments indicate that the hierarchical parser does use a sentence splitter by default, but combining the sentence window algorithm with hierarchical+auto merging may not be possible. Another community member suggests that the sentence splitter in llama-index is more advanced than just splitting by token count, and respects sentence boundaries while trying to remain within token limits for chunk size. The community members discuss potential implementations, such as splitting the document into sentences or chapters, and using auto-merging to reduce context size, but there is no explicitly marked answer.
Am I missing something or why isn't there a hierarchical sentence node parser? I am seeing many good results with sentence windows splitting and also hierarchical splitting and automerging retriver. Why not combine those?

Is there a reason a automerger would not make sense for sentence based splitting?
L
I
5 comments
The hierarhical parser does use a sentence splitter under the hood by default.

But combining the sentence window algorithm with hierarchical+auto merging seems... not possible? I'm not even sure what the algorithm would look like in that case
Thanks Logan M! Yeah the sentence splitter is quite naive as it splits by token count. Or this has been by my understanding. From natural language viewpoint I would assume the semantics would conserved better if that split was done by slitting into sentences as done by SentenceWindowNodeParser.

I was thinking of one of the following:
1) You split a document into sentences( possibly add a sentence window). The leafs in the hierarchy would be the individual leaves and branches would be some set of sentences(for example 10).
Or
2) split document into chapter(if available) and use them as branches and split the chapters into sentences and use them as the leaves.

When used with auto merging retriever this could make the context shorter for example: If each sentence in the chapter was in the top k and the sentence window was set to 3 the whole chapter would be in the context 3 times just split into separate source nodes. With auto merging the chapter would be in the context once as a whole.

I have not checked how this would fit with the current implementations yet. More looking for opinions if this makes any sense at all. So hope you catch what I am running after here!
sentence splitter is quite naive as it splits by token count -- this isn't true actually. The SentenceSplitter in llama-index splits by respecting sentence boundaires, but also trying to remainin within token limits for chunk size
I think the implementation you describe makes some sense. You could definitely implement this
Great to hear that the SentenceSplitter does that. I will take a look into this and see if I would have the time to look at implementing the above strategies, or if it even makes, considering the SentenceSplitter has such advanced features. Thanks for your help!
Add a reply
Sign up and join the conversation on Discord