Find answers from the community

Updated 3 months ago

Chunk size issue

Hi , Any help on below error
Plain Text
text_chunks = self.text_splitter.split_text(document.text)

File "/usr/local/lib/python3.8/site-packages/llama_index/langchain_helpers/text_splitter.py", line 118, in split_text
    text_splits = self.split_text_with_overlaps(text, extra_info_str=extra_info_str)
File "/usr/local/lib/python3.8/site-packages/llama_index/langchain_helpers/text_splitter.py", line 157, in split_text_with_overlaps
    raise ValueError(
ValueError: A single term is larger than the allowed chunk size.
Term size: 1094
Chunk size: 512Effective chunk size: 512


If i have no control over the document, how can i force the chunk size
L
A
S
14 comments
This is a pretty common issue for documents/text that don't contain a lot of spaces

Your best bet is using a different text splitter I think, or customizing the seperator token to not be a space
Hi @Logan M, We did use backup separators in our text splitter which has reduced number of errors (Thanks it's a great feature to use with)
Plain Text
        self.chunk_size = 200
        self.chunk_overlap = 50
        self.backup_separators = [".", ",", "!"]
        # Text splitter
        self.text_splitter = TokenTextSplitter(
            chunk_size=self.chunk_size,
            chunk_overlap=self.chunk_overlap,
            backup_separators=self.backup_separators,
        )

What we need here is Can TokenTextSplitter splitter forcefully chunk the text with token count? What else can we do to improve the chunking?
so that, we reduce the chunkoverlap and chunksize errors?

@jerryjliu0 Would it possible to have an extra parameter with text splitters which can supress the errors (In case anyone want to ignore the chunk errors)?
@Abhishek you could also use the recursive character splitter, which will split on single letters rather than tokens πŸ€”
hey, any solution for this issue https://github.com/jerryjliu/llama_index/issues/3242
cc: @Abhishek @Logan M @jerryjliu0 @ravitheja
has anyone face error from open ai embedding this
cc: @ravitheja @Logan M
Did you ever try using the recursive text splitter? It will likely solve most token issues for non-english text
This looks like a network issue with sending requests to openai πŸ‘€
@Logan M Yes, We are trying the RecursiveTextSplitter and Also have posted the code snippet on the github issue mentioned.
Hey @Logan M, Hope you're well. Any updates on this?
Ngl I haven't looked at this at all πŸ˜… apologies for that

Have you tested on the latest version of llama index? Do you have a document/source code you can provide to replicate the issue?
Np @Logan M, We don't have the document but Yes, I have shared the code snippet for the same in the mentioned github issue here https://github.com/jerryjliu/llama_index/issues/3242
No going to lie, it's going to be really hard to fix this without having a way to reproduce it. Is there anyway you can create document to reproduce it reliably?
Hey, I understand but the error doesn't popup when processing a document but It has occured when we received a query method
Right, but that means you indexed some document that causes the issue when querying right? That's the key to replicating and fixing this
Add a reply
Sign up and join the conversation on Discord