Sentence splitter

YYuhong Sun

I'm trying to use the SentenceSplitter but it's removing all the spaces, newlines etc between sentences. Is there some way to keep the text in the input format? Even within the same node, all of the whitespaces between sentences are gone.

29 comments

LLogan M

Hmm yea that's a good point tbh.

As a worksround, you can swap to use the older token splitter at the moment to retain formatting

I'll have to think/debug on a fix here for this

YYuhong Sun

which one is that?

LLogan M

Plain Text

from llama_index.text_splitter import TokenTextSplitter

TTeemu

Plain Text

import tiktoken
from llama_index.text_splitter import TokenTextSplitter

text_splitter = TokenTextSplitter(
  separator=" ",
  chunk_size=1024,
  chunk_overlap=20,
  backup_separators=["\n"],
  tokenizer=tiktoken.encoding_for_model("gpt-3.5-turbo").encode
)

node_parser = SimpleNodeParser.from_defaults(text_splitter=text_splitter)

TTeemu

https://gpt-index.readthedocs.io/en/stable/core_modules/data_modules/node_parsers/usage_pattern.html#text-splitter-customization

YYuhong Sun

Could I get an update when the SentenceSplitter is updated?
You could either keep the separator characters as part of the split, that way when you merge them, it doesn't just remove those

YYuhong Sun

😦 I would use this but our repo already basically does this so, the point was to switch to Sentence/Paragraph aware chunking

LLogan M

That's not a bad idea. I can likely get that fixed today. Will keep you posted

YYuhong Sun

Hey nice, thanks :D, I appreciate the fast turnaround

YYuhong Sun

If it helps, I think I'm hitting it because of the nltk tokenizer throwing it out 😛

LLogan M

Yup, that was my guess as well 😅

LLogan M

@Yuhong Sun correct me if I'm wrong, but the formatting is still there (i.e. newlines)

But the spaces are gone

Plain Text

>>> print(text)
Here is my todo list:
1. Clean
2. Shop
3. Rest

>>> splits = SentenceSplitter(chunk_size=10, chunk_overlap=0).split_text(text)
>>> splits
['Here is my todo list:\n1.', 'Clean\n2.Shop\n3.Rest']
>>>

(also, the splitterd doesn't split this nicely -- but just a quick example to illustrate current behaviour)

YYuhong Sun

Plain Text

long_text = “Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes. But I warn you, if you don’t tell me that this means war, if you still try to defend the infamies and horrors perpetrated by that Antichrist—I really believe he is Antichrist—I will have nothing more to do with you and you are no longer my friend, no longer my ‘faithful slave,’ as you call yourself! But how do you do? I see I have frightened you—sit down and tell me all the news.”

tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-small")
sent_splitter = SentenceSplitter(tokenizer=tokenizer.tokenize, chunk_size=128, chunk_overlap=100)
default_splitter = get_default_text_splitter()


node_parser = SimpleNodeParser.from_defaults(text_splitter=sent_splitter)

nodes = get_nodes_from_document(Document(text=long_text), default_splitter,
                                include_metadata=False,
                                include_prev_next_rel=False
)
texts = [node.text for node in nodes]
print(texts)

YYuhong Sun

It's dropping both, I tried with newline after the first sentence and with space. Also I'm seeing it with both the default splitter and the one that is constructed from the huggingface tokenizer

LLogan M

hmm I'll try this text as well

LLogan M

Wait, I don't see a newline in that long_text snippet 👀

YYuhong Sun

In this case I was showing it failed with space, it also fails with newline

YYuhong Sun

(fails with both)

YYuhong Sun

Are you able to repro? I can probably give a more minimal example if it helps

LLogan M

Yea was able to repo -- just wrapping up some other stuff first 🙂

YYuhong Sun

great! no rush, just making sure I've given you the info you need

LLogan M

How does this look? A collegue actually already had a fix for this it seems, just not merged yet

Plain Text

>>> from llama_index.text_splitter import SentenceSplitter
>>> text = """Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes.\n But I warn you, if you don’t tell me that this means war, if you still try to defend the infamies and horrors perpetrated by that Antichrist—I really believe he is Antichrist—I will have nothing more to do with you and you are no longer my friend, no longer my ‘faithful slave,’ as you call yourself! But how do you do? I see I have frightened you—sit down and tell me all the news."""
>>> splits = SentenceSplitter(chunk_size=25, chunk_overlap=0).split_text(text)
>>> splits
['Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes.', 'But I warn you, if you don’t tell me that this means war, if you still try to defend the', 'infamies and horrors perpetrated by that Antichrist—I really believe he is Antichrist—I will have nothing', 'more to do with you and you are no longer my friend, no longer my ‘faithful slave,', '’ as you call yourself! But how do you do?', 'I see I have frightened you—sit down and tell me all the news.']
>>> splits = SentenceSplitter(chunk_size=50, chunk_overlap=0).split_text(text)
>>> splits
['Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes.\n But I warn you, if you don’t tell me that this means war,', 'if you still try to defend the infamies and horrors perpetrated by that Antichrist—I really believe he is Antichrist—I will have nothing more to do with you and you are no longer my friend,', 'no longer my ‘faithful slave,’ as you call yourself! But how do you do? I see I have frightened you—sit down and tell me all the news.']
>>>

LLogan M

PR for it, if you wanted to checkout the branch: https://github.com/jerryjliu/llama_index/pull/7590

YYuhong Sun

Looks good, would have to try it out more but this looks correct

YYuhong Sun

Also, to save me from having to read more code, can you explain how the chunk_overlap works? will it not overlap unless the overlap can contain a sentence?

LLogan M

Tbh, I'm not 100% sure how the chunk overlap works haha

The sentence splitter works by first splitting.

And then merging back into larger chunks of sentences.

I'm guessing the overlap takes then start of the next chunk and adds it to the current? 🤷‍♂️ I usually don't mess with that setting and just leave it at the default lol

LLogan M

Also cool, I'll merge that then

YYuhong Sun

Seems a bit weird that one section is completely contained by another. Pretty sure this is because of the overlap but it's a bit strange regardless

Plain Text

sent_splitter = SentenceSplitter(tokenizer=tokenizer.tokenize, chunk_size=60, chunk_overlap=20)

Attachment

LLogan M

ha yea, a little weird. Definitely because of small chunks + large overlap

Normally I would set the overlap as a percentage (and also, I wouldn't use a chunk size of 60 tbh unless it's for a super specific use case)

Add a reply

Find answers from the community

Sentence splitter