So I think I’ve reached the accuracy

At a glance

So I think I’ve reached the accuracy limitations with the barebones components. I’m thinking about designing my own node parser…because we’re accepting solely XML documents I have a lot of control.

Basically, 1) any suggestions/tips, 2) what size of node is optimal before the embeddings blend together 3) does it matter if some nodes are longer than others (is there some kind of class imbalance situation)

13 comments

LLogan M

hybrid search is a good approach (i.e. combining vector retrieval with keyword search). And then re-ranking the results. I actually just merged a PR which added BM25, there's a notebook here with a custom hybrid retriever
https://gpt-index.readthedocs.io/en/stable/examples/retrievers/bm25_retriever.html#advanced-hybrid-retriever-re-ranking

Node size tends to depend on the embedding model being used. OpenAI embeds 1536 dimensions. I find chunks smaller than 512-ish may not be great? Your mileage may vary

Doesn't matter if some nodes are larger/smaller imo

iisaackogan

Would it be appropriate conjecture that the closer nodes are to properly representing the data (eg a node being a “thought”/concept) the more accurate retrieval?

iisaackogan

I’m wondering if it would be a good idea to use the LLM to aid the index

iisaackogan

Probably too expensive

LLogan M

Yea anything involving extra LLM calls during retrieval will suck

LLogan M

Hmm, not sure what that means tbh

iisaackogan

so, embedding involves vector representation of our chunks of data.

say a paragraph:

Logan went to the store. He bought cheese, apples, bread. For all of this, logan paid $10.

In this case, what I'm saying is, [Logan went to the store] [He bought cheese, apples, bread] [For all this, logan paid $10] feels like a better representation than [Logan went to the store. He bought cheese, apples, bread] [For all this, logan paid $10].

Basically, does retrieval accuracy increase if you can split thoughts into their own unique chunks of data.

LLogan M

Yea if you could identify the chunks in that manner, it may help 🤔 (tbh in that example, the entire thing would work best as a chunk, but I think I see what you are getting at)

iisaackogan

Yeah, in a real life I would not go that specific, you would lose the relationship the sentences have to each other

iisaackogan

See, what I have are XML documents containing table data

iisaackogan

My thinking is, since I have a structured data format, I should be able to very clearly identify chunks, e.g by making a table a row of data

iisaackogan

I'm really just brainstorming out loud 🙂

LLogan M

Yea that makes sense. For tables and stuff too, if they are more numerical/large, it can be helpful to put them into a dataframe and use a pandas query engine

We actually have this concept of recursive retrieval, where you can "retrieve" and index essentially

https://gpt-index.readthedocs.io/en/stable/examples/query_engine/pdf_tables/recursive_retriever.html

https://gpt-index.readthedocs.io/en/stable/examples/query_engine/recursive_retriever_agents.html

Add a reply

Find answers from the community

So I think I’ve reached the accuracy