Tree Indexes

SSteam

I have a question about the different index types, and which would be appropriate for my use case.

I have a dataset of 100 products. Each product is a food item and is made of ingredients. I am attempting to build an index that would be sufficient for a LangChain agent to answer questions about those items (i.e. which ones contain a particular ingredient, what an ingredient is, etc.)

Each product and ingredient has a description, and that's what I intend to use as the "page content".

Since there is a natural hierarchy to the product/item relationship (products contain ingredients), I was considering using GPTTreeIndex. I'm attaching the code I used in this paste-bin (https://pastebin.com/KSBH80QZ) as it's too long to include here, but this is the general framework of my code

Plain Text

for ingredient in ingredients:
    # Build the doc and index for the current ingredient
    doc = Document(text = ingredient_name + ingredient_description, doc_id = ingredient_name)
    index = GPTSimpleVectorIndex([doc])
    
    # Append the ingredient name, doc, and index to the ingredient_tuple list
    ingredient_tuple.append(ingredient_tuple(ingredient, doc, index))
    
for product in products:
    # Get the unique ingredients for the current product
    ingredients = get_product_ingredients(product)
    
    # Create a GPTTreeIndex for the product using the ingredient indices
    product_index = GPTTreeIndex([x.index for x in ingredient_tuple if x.ingredient_name in ingredients])

    # Append the product name and index to the product_tuple list
    product_tuple.append(product_tuple(product, product_index))

# Create a GPTTreeIndex for the portfolio using the product indices
portfolio_index = GPTTreeIndex([x.index for x in product_tuple])

I'm not getting the results I expect after running portfolio_index.query, and I don't know why I'm using GPTTreeIndex here.

Does anyone have advice or guidance on when to use these? I've gone through the docs already. TY

13 comments

LLogan M

Not going to lie, the code is a little confusing 😅

But regardless, I can give some general tips.

The way you are creating documents makes sense (I.e one document per ingredient/product)

But it looks like to me you are creating vector indexes with a single document? Which won't really work too well.

Also, I would make sure to use the more-supported composable graph class when creating indexes over indexes
https://gpt-index.readthedocs.io/en/latest/how_to/composability.html

Each sub index needs some kind of summary text.

As for how to best structure this information, it really depends on what types of queries you are expecting to perform. I would expect a vector index for all ingredients and a vector index for all products, wrapped with a list index, to be a good first attempt.

You could also try making a single tree index with all the documents, but not sure how that would go

SSteam

Completely fair comment about the code 😂

In the composability link you sent, there are three documents read in and then each is indexed separately using GPTTreeIndex.
If they’re composed at the end, it seems like one index per doc would work fine with the Tree index?

I’ll try combining all of these in a single index and report back - thanks for your help!

LLogan M

One index per doc would only work well if the document is long. But from what I can tell, your documents are probably pretty short? 🤔

jjerryjliu0

@Steam cc @Logan M : composing a graph over your data using the link Logan linked above could definitely be a good idea. It does have tradeoffs though in that queries over graphs will be a bit slower + use more tokens.

In general I wouldn't use the tree index over large documents, the tree index is better just for routing.

In the meantime here's a simpler idea you could try just to start with:
Convert each product into a Document, and add "extra_info" to each product to product info, e.g.

Plain Text

doc_prod1 = Document("text", extra_info={"product_name": "<product name>"})

When this document is split into smaller chunks, the metadata will be injected.

Then put all documents into a GPTSimpleVectorIndex.

bbSharpCyclist

I've been playing with different types of indexes as well. Here is a scenario. I have a collection of YouTube vids and their transcripts. I created a single vector index from all those txt files. That seems to work OK. But I was wondering if I could make it better. So I tried creating a vector index for each transcript and then wrapping a tree around that. That doesn't work so well, haha. So I just continue to play (and pay) and learn 🙂

jjerryjliu0

One thing you could do, is for each YouTube video/transcript (assuming it's a separate Document object), specify extra_info metadata; this metadata will be injected into each chunk derived from the transcript

bbSharpCyclist

Thanks @jerryjliu0 , this is what I did the first time around, using something like the below. Each transcript is a separate file.

Plain Text

# Read documents from disk
documents = SimpleDirectoryReader(directory, file_metadata=filename_to_metadata).load_data()

# Create index
index = GPTSimpleVectorIndex(documents, include_extra_info=True)

bbSharpCyclist

And note, I finally got access to the gpt4 api today. The results from it are much better than gpt3.5-turbo.

jjerryjliu0

Sweet!

jjerryjliu0

ah gotcha

DDon Ramoncillo

Hi @bSharpCyclist , I've been trying to do this for several days but I can't... could you help me how you are doing it?

bbSharpCyclist

@Don Ramoncillo sorry, I haven't been around for awhile. I started a new gig recently and have been busy. In my use case I opted for a single vector index. The list index didn't work well.

DDon Ramoncillo

Don't worry, I've been making some progress on the topic and defined a structure with @Logan M

https://discord.com/channels/1059199217496772688/1098137013011615762

Add a reply

Find answers from the community

Tree Indexes