Find answers from the community

Updated last year

Sentence splitter

I'm trying to use the SentenceSplitter but it's removing all the spaces, newlines etc between sentences. Is there some way to keep the text in the input format? Even within the same node, all of the whitespaces between sentences are gone.
L
Y
T
29 comments
Hmm yea that's a good point tbh.

As a worksround, you can swap to use the older token splitter at the moment to retain formatting

I'll have to think/debug on a fix here for this
which one is that?
Plain Text
from llama_index.text_splitter import TokenTextSplitter
Plain Text
import tiktoken
from llama_index.text_splitter import TokenTextSplitter

text_splitter = TokenTextSplitter(
  separator=" ",
  chunk_size=1024,
  chunk_overlap=20,
  backup_separators=["\n"],
  tokenizer=tiktoken.encoding_for_model("gpt-3.5-turbo").encode
)

node_parser = SimpleNodeParser.from_defaults(text_splitter=text_splitter)
Could I get an update when the SentenceSplitter is updated?
You could either keep the separator characters as part of the split, that way when you merge them, it doesn't just remove those
😦 I would use this but our repo already basically does this so, the point was to switch to Sentence/Paragraph aware chunking
That's not a bad idea. I can likely get that fixed today. Will keep you posted
Hey nice, thanks :D, I appreciate the fast turnaround
If it helps, I think I'm hitting it because of the nltk tokenizer throwing it out 😛
Yup, that was my guess as well 😅
@Yuhong Sun correct me if I'm wrong, but the formatting is still there (i.e. newlines)

But the spaces are gone

Plain Text
>>> print(text)
Here is my todo list:
1. Clean
2. Shop
3. Rest

>>> splits = SentenceSplitter(chunk_size=10, chunk_overlap=0).split_text(text)
>>> splits
['Here is my todo list:\n1.', 'Clean\n2.Shop\n3.Rest']
>>> 


(also, the splitterd doesn't split this nicely -- but just a quick example to illustrate current behaviour)
Plain Text
long_text = “Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes. But I warn you, if you don’t tell me that this means war, if you still try to defend the infamies and horrors perpetrated by that Antichrist—I really believe he is Antichrist—I will have nothing more to do with you and you are no longer my friend, no longer my ‘faithful slave,’ as you call yourself! But how do you do? I see I have frightened you—sit down and tell me all the news.”

tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-small")
sent_splitter = SentenceSplitter(tokenizer=tokenizer.tokenize, chunk_size=128, chunk_overlap=100)
default_splitter = get_default_text_splitter()


node_parser = SimpleNodeParser.from_defaults(text_splitter=sent_splitter)

nodes = get_nodes_from_document(Document(text=long_text), default_splitter,
                                include_metadata=False,
                                include_prev_next_rel=False
)
texts = [node.text for node in nodes]
print(texts)
It's dropping both, I tried with newline after the first sentence and with space. Also I'm seeing it with both the default splitter and the one that is constructed from the huggingface tokenizer
hmm I'll try this text as well
Wait, I don't see a newline in that long_text snippet 👀
In this case I was showing it failed with space, it also fails with newline
(fails with both)
Are you able to repro? I can probably give a more minimal example if it helps
Yea was able to repo -- just wrapping up some other stuff first 🙂
great! no rush, just making sure I've given you the info you need
How does this look? A collegue actually already had a fix for this it seems, just not merged yet

Plain Text
>>> from llama_index.text_splitter import SentenceSplitter
>>> text = """Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes.\n But I warn you, if you don’t tell me that this means war, if you still try to defend the infamies and horrors perpetrated by that Antichrist—I really believe he is Antichrist—I will have nothing more to do with you and you are no longer my friend, no longer my ‘faithful slave,’ as you call yourself! But how do you do? I see I have frightened you—sit down and tell me all the news."""
>>> splits = SentenceSplitter(chunk_size=25, chunk_overlap=0).split_text(text)
>>> splits
['Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes.', 'But I warn you, if you don’t tell me that this means war, if you still try to defend the', 'infamies and horrors perpetrated by that Antichrist—I really believe he is Antichrist—I will have nothing', 'more to do with you and you are no longer my friend, no longer my ‘faithful slave,', '’ as you call yourself! But how do you do?', 'I see I have frightened you—sit down and tell me all the news.']
>>> splits = SentenceSplitter(chunk_size=50, chunk_overlap=0).split_text(text)
>>> splits
['Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes.\n But I warn you, if you don’t tell me that this means war,', 'if you still try to defend the infamies and horrors perpetrated by that Antichrist—I really believe he is Antichrist—I will have nothing more to do with you and you are no longer my friend,', 'no longer my ‘faithful slave,’ as you call yourself! But how do you do? I see I have frightened you—sit down and tell me all the news.']
>>> 
PR for it, if you wanted to checkout the branch: https://github.com/jerryjliu/llama_index/pull/7590
Looks good, would have to try it out more but this looks correct
Also, to save me from having to read more code, can you explain how the chunk_overlap works? will it not overlap unless the overlap can contain a sentence?
Tbh, I'm not 100% sure how the chunk overlap works haha

The sentence splitter works by first splitting.

And then merging back into larger chunks of sentences.

I'm guessing the overlap takes then start of the next chunk and adds it to the current? 🤷‍♂️ I usually don't mess with that setting and just leave it at the default lol
Also cool, I'll merge that then
Seems a bit weird that one section is completely contained by another. Pretty sure this is because of the overlap but it's a bit strange regardless

Plain Text
sent_splitter = SentenceSplitter(tokenizer=tokenizer.tokenize, chunk_size=60, chunk_overlap=20)
Attachment
image.png
ha yea, a little weird. Definitely because of small chunks + large overlap

Normally I would set the overlap as a percentage (and also, I wouldn't use a chunk size of 60 tbh unless it's for a super specific use case)
Add a reply
Sign up and join the conversation on Discord