Hello,

At a glance

The community member is looking for a way to split text into chunks that stop at the end of a paragraph and do not overlap. They have tried using a sentence splitter with "\n\n\n" but it does not seem to work. The other option of having hundreds of separate text files is not preferred.

The comments suggest that the community member could write their own chunking logic and manually create nodes. Some community members provide code examples for this approach, which involves splitting the text by paragraphs and creating TextNode objects for each chunk.

There is a discussion around the compatibility of this approach with service context and the need for metadata. Some community members suggest that the metadata can be added manually if needed.

The community member also tried using the SentenceSplitter from the llama_index library, but the code they provided does not seem to work as expected. Another community member suggests trying to use the text_splitter.split_text(text) method directly to debug the issue.

Finally, a community member mentions that they have forked the repository and modified the text splitter to have the desired behavior, and they ask if this could be a useful feature for others. Another community member suggests that if the modification

TThomas1234

Hello,
Is there a way to determine when to stop a chunk ? I want to have chunks that stops at the end of a paragraph and does not overlap. I’ve tried with sentence splitter with \n\n\n but it does not seems to be doing it. The other option would be to have hundreds of separated little txt files but I’d rather not

18 comments

LLogan M

Probably should just write your own chunking logic then and manually create nodes? 🤔

LLogan M

Plain Text

from llama_index import TextNode

texts = text.split("\n\n\n")

nodes = []
for text in texts:
  nodes.append(TextNode(text=text))

LLogan M

something like that lol

LLogan M

assuming the text has paragraphs nicely seperated

TThomas1234

Would it be compatible with service context ? Like making my own node parser ?

LLogan M

You can construct the index with the raw nodes, no need for the node parser

LLogan M

but you could implement a custom node parser if you really wanted

LLogan M

index = VectorStoreIndex(nodes, ...)

TThomas1234

@Logan M wouldn’t it perform much worse without the metadata when doing retrieving or it won’t have a significant impact ?

LLogan M

The only metadata really is the metadata added on the original input document. You can add that yourself if you have some

Plain Text

TextNode(text=text, metadata={...})

TThomas1234

I see, why does this code not work to do so from llama_index.text_splitter import SentenceSplitter

text_splitter = SentenceSplitter(
separator=" ",
chunk_size=1024,
chunk_overlap=20,
paragraph_separator="\n\n\n",
secondary_chunking_regex="[^,.;。]+[,.;。]?",
tokenizer=tiktoken.encoding_for_model("gpt-3.5-turbo").encode,
)

node_parser = SimpleNodeParser.from_defaults(text_splitter=text_splitter)

LLogan M

I have no idea haha

You can try using the text splitter directly on some sample text to debug

It will split by paragraphs, but then merge back up to match the chunk size

text_splitter.split_text(text)

TThomas1234

Okay and doing the manual creation of nodes, it’s possible to persist the Index and load it afterwards?

LLogan M

for sure! The nodes are the same, just a different source 🙂

TThomas1234

Thanks I’ll give it a try then 🙂

TThomas1234

@Logan M hello, I’ve forked the repo and modified the dented splitter class to have the behavior I want, which is to have chunks either of size x (256,512…) or have a chunk of size smaller than d that stops at a new paragraph. Do you think this is a useful feature for other person ? Like having a end_chunk_separator parameter or would this PR be useless ?

LLogan M

Hmm, if it doesn't feel too specific to your data/usecase, then it might be useful!

LLogan M

Glad it works though 👌👏

Add a reply

Find answers from the community

Hello,