Find answers from the community

Updated 3 months ago

I am a bit curious, I watched an

I am a bit curious, I watched an implementation of small to big retrieval and noticed that they include the same information over various chunk sizes. During retrieval wouldn’t we be getting the same information over various chunks sizes?
L
V
12 comments
no, it gets de-duplicated and/or merged
Right, but wouldn't you limit the "different" chunks you are getting, essentially limiting the context? Cause you might end up retrieving kinda the same info (same parent) that is present on both the 256,512 and 1024 chunk.
only the bottom level of chunks are actually retrieved
and then they get merged up if enough children to a parent are retrieved
so for example
  • top k = 10
  • 10 chunks of 256 tokens are retrieved
  • chunk1 and chunk2 have the same parent, they get merged
  • chunk3 and chunk4 have the same parent, they get merged
  • now we have 2 chunks at 512 tokens, 6 chunks at 256
  • the 512 chunks have the same parent, they get merged
  • now we have a single 1024 chunk, and 6 chunks at 256 tokens
Interesting so although we might be creating chunks of smaller sizes eg 128,256,512 (given a parent of 1024), for the retrieval only the smallest ones will be used for our case the 128. I suppose the intermediate ones are used for the merging process you mentioned? Really appreciate the insights you provided
Yea exactly, you got it 👍
Thank you. I do have a followup question thought it is more of an opinion. Would you pair this retrieval approach with a reranker? If yes, I suppose you would do it after the chunks were merged
Yea reranking makes a lot of sense, especially to fitler stuff out because you have to set the initial top-k pretty high for merging to happen nicely
So high topk for retrieving small chunks -> merging -> reranking with small topk. Any rough advices on the initial topk size?
🤷‍♂️ hard to say. Probably 15 or 20 is a good start?
Just wanted to get a rough idea on the size, that would do. Again thank you so much for your help
Add a reply
Sign up and join the conversation on Discord