Find answers from the community

Updated 2 months ago

Is sentence splitter still optimal for

Is sentence splitter still optimal for embedding models like bge-m3 that can vectorize a whole article or paragraph?
L
d
8 comments
the sentence splitter isn't splitting into single sentences, its splitting into chunks that respect sentence boundaries
Ok but how does it factor in things such as titles, subsections and paragraphs under subsections? <h1> vs <h2> etc
I also want to respect subsection boundaries!
I also need to know the maximum size of an m3 chunk in terms of ASCII characters
section boundaries are harder -- those should probably be split before applying a the sentence splitter, using your own algorithm
Even if I am using a bge-m3 embedding model?
What is maximum size of chunk I can use with bge-m3?
bge-m3 has an 8k context limit
Add a reply
Sign up and join the conversation on Discord