Find answers from the community

Updated 8 months ago

Sort of a general RAG question (using

At a glance
Sort of a general RAG question (using llama-index) to anyone. If you have some sample text data:

Plain Text
I have a corpus of documents that I have broken down into chunks. Each chunk is about 20 sentences long. I also chunked these documents with a sliding window to maintain context. I used the openai embeddings model to create a vector for each chunk of text. Currently, when the user submits a query, the app will embed this query, perform a semantic search against the vector database, then provide the gpt model the top 10 chunks of text with the user query, gpt then provides an answer to the query.


You can use embeddings w/ llamaindex's tools to semantically split this into different chunks.

Let's say that returns you 3 chunks:

Plain Text
I have a corpus of documents that I have broken down into chunks. Each chunk is about 20 sentences long. I also chunked these documents with a sliding window to maintain context. 

I used the openai embeddings model to create a vector for each chunk of text. 

Currently, when the user submits a query, the app will embed this query, perform a semantic search against the vector database, then provide the gpt model the top 10 chunks of text with the user query, gpt then provides an answer to the query.


What I'm wondering is basically, what does the tradeoff look like for these smaller semantic chunks as opposed to a large chunk.

In my head, if you do that initial paragraph as 1 vector, vs. 3 vectors (1 of each chunk), your retrieval ability should be higher with the second approach. Each vector, to me, will be less 'diluted' in terms of info. But what happens when information in 1 semantic unit is dependent on the previous. For example, if chunk 2 only makes sense after reading chunk 1. Are you SOL?

I guess I can't seem to (neither mathametically or logically) figure out what that tradeoff looks like in terms of IR accuracy.
L
i
R
34 comments
I think this is an age-old question πŸ˜… Chunking is the bane of semantic retrieval

One idea could be, why not both? Embed and retrieve multiple versions of the same text (i.e. semantically grouped, paragraph grouped), and then do some sort of.... fuzzy deduplication πŸ€” Or let the LLM figure it in a postprocessing step?
The question definitely isn't new, but with regards to semantic chunking specifically, technically shouldn't they all be 1 chunk if they're semantically relevant :^)
you know what i mean?
Embed and retrieve multiple versions of the same text (i.e. semantically grouped, paragraph grouped),
this could work πŸ€”
i miss when ai was just a buzzword i heard about 😭
technically they should be I guess haha -- is there a specific case where they aren't for you? Or just something you are thinking about?

I guess there might be a case where they arent really semantically related, but more... linguistically related? Like co-reference resolution, parts-of-speech, all that jazz. I'm not sure how well embeddings capture that or not
oh boy co-reference resolution was not even something i was thinking about lol
i would guess that increasing the sentence size 'buffer' (i.e. > 1) reduces that. but then you are increasing the odds of semantically irrelevant info accidentally being in a chunk, and then obv that will likely decrease the semantic similarity

e.g.

AAAB split into

AA
AB
would probably produce chunks of [AA, AB] since the co-reference would have to be over multiple sentences to be a problem

whereas split into
A
A
B
B
would probably produce [AAA, B] but run into those issues in those cases you mentioned
so, trade-off ig
yea true true.

I think a postprocessing step that uses an LLM to decide if it still needs the prev or next node in the sequence (after retrieval) makes sense (and is something llama-index has). But that will incur extra runtime due to the extra LLM call
it always comes back to LLMs 😭
but i guess if it didn't, LLMs wouldn't be the major breakthrough they were
they are definitely a crutch for hard problems lol
the alternative is training some smaller model to make that decision, but who wants to do that πŸ˜†
no exactly πŸ˜‚
I realized a little while ago there is no clean RAG solution that fits all
I know other people realized that loooong before me
God-Mode RAG cant come fast enough
my fear is it'll come down to massive human efforts to cover tons of edge cases and then get locked behind cloud providers
not as long as llamaindex exists ❀️
we are actively thinking about this too -- some sort of reference architecture
I wish I knew of more architectures to reference for production-ready applications
Would be super helpful in making these kindsa decisions
A lot of the time it just comes down to 'well I think this will logically work better' but testing each individual decision is a huuuuuuuge amount of work
For the important ones I make a dataset, run the tests & have my answer
But obv there's so many cases haha πŸ˜„
for some cases, you might wanna also check out unstructured's chunk by title. It seems to make sense more than splitting by sentence + any overlap for many cases.
But that's a decision you gotta make on a case by case basis. no general rules.
i chunk by title myself in the parser
but with further subdivisions
Add a reply
Sign up and join the conversation on Discord