Find answers from the community

Updated 2 months ago

Hello,

Hello,
Is there a way to determine when to stop a chunk ? I want to have chunks that stops at the end of a paragraph and does not overlap. I’ve tried with sentence splitter with \n\n\n but it does not seems to be doing it. The other option would be to have hundreds of separated little txt files but I’d rather not
L
T
18 comments
Probably should just write your own chunking logic then and manually create nodes? 🤔
Plain Text
from llama_index import TextNode

texts = text.split("\n\n\n")

nodes = []
for text in texts:
  nodes.append(TextNode(text=text))
something like that lol
assuming the text has paragraphs nicely seperated
Would it be compatible with service context ? Like making my own node parser ?
You can construct the index with the raw nodes, no need for the node parser
but you could implement a custom node parser if you really wanted
index = VectorStoreIndex(nodes, ...)
@Logan M wouldn’t it perform much worse without the metadata when doing retrieving or it won’t have a significant impact ?
The only metadata really is the metadata added on the original input document. You can add that yourself if you have some

Plain Text
TextNode(text=text, metadata={...})
I see, why does this code not work to do so from llama_index.text_splitter import SentenceSplitter

text_splitter = SentenceSplitter(
separator=" ",
chunk_size=1024,
chunk_overlap=20,
paragraph_separator="\n\n\n",
secondary_chunking_regex="[^,.;。]+[,.;。]?",
tokenizer=tiktoken.encoding_for_model("gpt-3.5-turbo").encode,
)

node_parser = SimpleNodeParser.from_defaults(text_splitter=text_splitter)
I have no idea haha

You can try using the text splitter directly on some sample text to debug

It will split by paragraphs, but then merge back up to match the chunk size

text_splitter.split_text(text)
Okay and doing the manual creation of nodes, it’s possible to persist the Index and load it afterwards?
for sure! The nodes are the same, just a different source 🙂
Thanks I’ll give it a try then 🙂
@Logan M hello, I’ve forked the repo and modified the dented splitter class to have the behavior I want, which is to have chunks either of size x (256,512…) or have a chunk of size smaller than d that stops at a new paragraph. Do you think this is a useful feature for other person ? Like having a end_chunk_separator parameter or would this PR be useless ?
Hmm, if it doesn't feel too specific to your data/usecase, then it might be useful!
Glad it works though 👌👏
Add a reply
Sign up and join the conversation on Discord