Can someone explain how a Tree Index is

At a glance

The post asks how a Tree Index is built, what makes a node a parent node, and how the index tree looks like. The comments provide some insights:

A community member suggests that the Tree Index is a tree of summaries, where chunks are summarized in pairs, and the summaries are further summarized recursively. Another community member is exploring the DocumentSummaryIndex and trying to understand how it differs from the Tree Index.

A community member acknowledges that the Tree Index is mostly technical debt and needs a major refactor, as it is difficult to work on and understand. Another community member provides a detailed explanation of how the Tree Index is built, with the bottom level containing summaries of chunks, and the upper levels recursively summarizing the summaries.

The comments also suggest that the Tree Index could be useful for querying large documents like books, but the number of children per branch is an arbitrary choice. Additionally, the summaries should be augmented with metadata like keywords, characters, places, and topics.

There is no explicitly marked answer in the post or comments.

ttilleul

Can someone explain how a Tree Index is actually built ? Given a list of nodes extracted from a text file (or another document) ? What makes a node a parent node ? Querying the index is somewhat explained in the docs but not what the index tree looks like ...

5 comments

LLogan M

I'm 99% sure it's just a tree of summaries no?

Takes you chunks, summarize in pairs, summarize those summaries, and so on

ttilleul

I really don't know ... right now I'm looking at something completely different: DocumentSummaryIndex and trying to figure out how it internally works so I can estimate when to use it ... sure there are examples and they work but llama_index is so huge that I need to better grasp the differences of every concept in there ...

For instance I made a PR to better detail the responses modes in the doc because so far it can be quite obscure ... "tree_summarize" is not exactly how I would have called it for instance as no summarize template is really used here and it's not requested to "summarize" anything in the query... it's more a "compact-and-refine-then-recursively-compact-and-refine-answers-until-only-one-remains" response mode 🙂

From what I have understood about DocumentSummaryIndex is that it creates summaries of documents (using a template summary requesting specifically a summary and a response synthesizer of your choice) ... the index is mostly a list of documents + their summaries. The response synthesizer (and the chosen response mode) being responsible for how the summaries are created.

Then, when queried with the default query engine (using as_query_engine), there's a special template that proposes to the LLM a list of documents (+their summaries, as stored in the index) to chose from to answer the question ... once the LLM has answered which documents he wants, the chunks of these documents are parsed using the "refine" response mode, refining the answer with every LLM call ...

So now I need to understand how TreeIndex is different ... back to reading the code I suppose ...

LLogan M

😅 Tbh the tree index is mostly tech debt right now. We've been debating whether or not to deprecate it. Otherwise it needs a big refactor, because it's pretty tough to work on or understand (as you can see lol). Tbh that's one part of the codebase i've spent very little time in haha

ttilleul

ok here's what I understood (might be interesting for others), correct me if I'm wrong ...

The tree is defined by the leaves each branch can have: each branch of the tree has num_children
First the chunks are summarized by groups of num_children
The bottom - 1 level is made of number_of_chunks / num_children (rounded up) branches containing a summary and each branch has num_children leaves/children_chunks
Each of the branches (summaries) in bottom-1 level are now considered as "chunks" and recursion occurs: branches are grouped by num_children and we go back to step 2 and the whole process repeat until there's nothing to summarize anymore ...

with num_children = 3, here's an example

Plain Text

-root summary
  +-- super-summary 1
  |   +--summary 1
  |   |   +--chunk 1
  |   |   +--chunk 2
  |   |   \--chunk 3  
  |   +--summary 2
  |   |   +--chunk 4
  |   |   +--chunk 5
  |   |   \--chunk 6
  |   +--summary 3
  |       +--chunk 7
  |       +--chunk 8
  |       \--chunk 9
  +-- super-summary 2
      +--summary 4
      |   +--chunk 10
      |   +--chunk 11
      |   \--chunk 12  
      +--summary 5
      |   +--chunk 13
      |   +--chunk 14
      |   \--chunk 15
      +--summary 6
          +--chunk 16
          \--chunk 17

ttilleul

It's interesting as this could kind of help query something as big as a book although num_children is a bit arbitrary to begin with ... also each summary should be augmented with metadata like keywords but also more specific like characters (if the book is a novel) or places or topics ...

Add a reply

Find answers from the community

Can someone explain how a Tree Index is