Find answers from the community

Updated 2 months ago

So I think I’ve reached the accuracy

So I think I’ve reached the accuracy limitations with the barebones components. I’m thinking about designing my own node parser…because we’re accepting solely XML documents I have a lot of control.

Basically, 1) any suggestions/tips, 2) what size of node is optimal before the embeddings blend together 3) does it matter if some nodes are longer than others (is there some kind of class imbalance situation)
L
i
13 comments
hybrid search is a good approach (i.e. combining vector retrieval with keyword search). And then re-ranking the results. I actually just merged a PR which added BM25, there's a notebook here with a custom hybrid retriever
https://gpt-index.readthedocs.io/en/stable/examples/retrievers/bm25_retriever.html#advanced-hybrid-retriever-re-ranking

Node size tends to depend on the embedding model being used. OpenAI embeds 1536 dimensions. I find chunks smaller than 512-ish may not be great? Your mileage may vary

Doesn't matter if some nodes are larger/smaller imo
Would it be appropriate conjecture that the closer nodes are to properly representing the data (eg a node being a “thought”/concept) the more accurate retrieval?
I’m wondering if it would be a good idea to use the LLM to aid the index
Probably too expensive
Yea anything involving extra LLM calls during retrieval will suck
Hmm, not sure what that means tbh
so, embedding involves vector representation of our chunks of data.

say a paragraph:

Logan went to the store. He bought cheese, apples, bread. For all of this, logan paid $10.

In this case, what I'm saying is, [Logan went to the store] [He bought cheese, apples, bread] [For all this, logan paid $10] feels like a better representation than [Logan went to the store. He bought cheese, apples, bread] [For all this, logan paid $10].

Basically, does retrieval accuracy increase if you can split thoughts into their own unique chunks of data.
Yea if you could identify the chunks in that manner, it may help 🤔 (tbh in that example, the entire thing would work best as a chunk, but I think I see what you are getting at)
Yeah, in a real life I would not go that specific, you would lose the relationship the sentences have to each other
See, what I have are XML documents containing table data
My thinking is, since I have a structured data format, I should be able to very clearly identify chunks, e.g by making a table a row of data
I'm really just brainstorming out loud 🙂
Yea that makes sense. For tables and stuff too, if they are more numerical/large, it can be helpful to put them into a dataframe and use a pandas query engine

We actually have this concept of recursive retrieval, where you can "retrieve" and index essentially

https://gpt-index.readthedocs.io/en/stable/examples/query_engine/pdf_tables/recursive_retriever.html

https://gpt-index.readthedocs.io/en/stable/examples/query_engine/recursive_retriever_agents.html
Add a reply
Sign up and join the conversation on Discord