Find answers from the community

Updated last year

Can someone explain how a Tree Index is

Can someone explain how a Tree Index is actually built ? Given a list of nodes extracted from a text file (or another document) ? What makes a node a parent node ? Querying the index is somewhat explained in the docs but not what the index tree looks like ...
L
t
5 comments
I'm 99% sure it's just a tree of summaries no?

Takes you chunks, summarize in pairs, summarize those summaries, and so on
I really don't know ... right now I'm looking at something completely different: DocumentSummaryIndex and trying to figure out how it internally works so I can estimate when to use it ... sure there are examples and they work but llama_index is so huge that I need to better grasp the differences of every concept in there ...

For instance I made a PR to better detail the responses modes in the doc because so far it can be quite obscure ... "tree_summarize" is not exactly how I would have called it for instance as no summarize template is really used here and it's not requested to "summarize" anything in the query... it's more a "compact-and-refine-then-recursively-compact-and-refine-answers-until-only-one-remains" response mode πŸ™‚

From what I have understood about DocumentSummaryIndex is that it creates summaries of documents (using a template summary requesting specifically a summary and a response synthesizer of your choice) ... the index is mostly a list of documents + their summaries. The response synthesizer (and the chosen response mode) being responsible for how the summaries are created.

Then, when queried with the default query engine (using as_query_engine), there's a special template that proposes to the LLM a list of documents (+their summaries, as stored in the index) to chose from to answer the question ... once the LLM has answered which documents he wants, the chunks of these documents are parsed using the "refine" response mode, refining the answer with every LLM call ...

So now I need to understand how TreeIndex is different ... back to reading the code I suppose ...
πŸ˜… Tbh the tree index is mostly tech debt right now. We've been debating whether or not to deprecate it. Otherwise it needs a big refactor, because it's pretty tough to work on or understand (as you can see lol). Tbh that's one part of the codebase i've spent very little time in haha
ok here's what I understood (might be interesting for others), correct me if I'm wrong ...

  1. The tree is defined by the leaves each branch can have: each branch of the tree has num_children
  2. First the chunks are summarized by groups of num_children
  3. The bottom - 1 level is made of number_of_chunks / num_children (rounded up) branches containing a summary and each branch has num_children leaves/children_chunks
  4. Each of the branches (summaries) in bottom-1 level are now considered as "chunks" and recursion occurs: branches are grouped by num_children and we go back to step 2 and the whole process repeat until there's nothing to summarize anymore ...
with num_children = 3, here's an example
Plain Text
-root summary
  +-- super-summary 1
  |   +--summary 1
  |   |   +--chunk 1
  |   |   +--chunk 2
  |   |   \--chunk 3  
  |   +--summary 2
  |   |   +--chunk 4
  |   |   +--chunk 5
  |   |   \--chunk 6
  |   +--summary 3
  |       +--chunk 7
  |       +--chunk 8
  |       \--chunk 9
  +-- super-summary 2
      +--summary 4
      |   +--chunk 10
      |   +--chunk 11
      |   \--chunk 12  
      +--summary 5
      |   +--chunk 13
      |   +--chunk 14
      |   \--chunk 15
      +--summary 6
          +--chunk 16
          \--chunk 17
It's interesting as this could kind of help query something as big as a book although num_children is a bit arbitrary to begin with ... also each summary should be augmented with metadata like keywords but also more specific like characters (if the book is a novel) or places or topics ...
Add a reply
Sign up and join the conversation on Discord