Chunk size issue

At a glance

The post describes an error encountered when using the text_splitter.split_text() function from the llama_index library, where a single term is larger than the allowed chunk size. Community members suggest using a different text splitter or customizing the separator token to address this issue. They also discuss using backup separators, the recursive character splitter, and suppressing chunk errors. However, there is no explicitly marked answer, and the community members are still trying to find a solution, as they don't have a reliable way to reproduce the issue.

Useful resources

AAbhishek

Hi , Any help on below error

Plain Text

text_chunks = self.text_splitter.split_text(document.text)

File "/usr/local/lib/python3.8/site-packages/llama_index/langchain_helpers/text_splitter.py", line 118, in split_text
    text_splits = self.split_text_with_overlaps(text, extra_info_str=extra_info_str)
File "/usr/local/lib/python3.8/site-packages/llama_index/langchain_helpers/text_splitter.py", line 157, in split_text_with_overlaps
    raise ValueError(
ValueError: A single term is larger than the allowed chunk size.
Term size: 1094
Chunk size: 512Effective chunk size: 512

If i have no control over the document, how can i force the chunk size

14 comments

LLogan M

This is a pretty common issue for documents/text that don't contain a lot of spaces

Your best bet is using a different text splitter I think, or customizing the seperator token to not be a space

AAbhishek

Hi @Logan M, We did use backup separators in our text splitter which has reduced number of errors (Thanks it's a great feature to use with)

Plain Text

        self.chunk_size = 200
        self.chunk_overlap = 50
        self.backup_separators = [".", ",", "!"]
        # Text splitter
        self.text_splitter = TokenTextSplitter(
            chunk_size=self.chunk_size,
            chunk_overlap=self.chunk_overlap,
            backup_separators=self.backup_separators,
        )

What we need here is Can TokenTextSplitter splitter forcefully chunk the text with token count? What else can we do to improve the chunking?
so that, we reduce the chunkoverlap and chunksize errors?

@jerryjliu0 Would it possible to have an extra parameter with text splitters which can supress the errors (In case anyone want to ignore the chunk errors)?

LLogan M

@Abhishek you could also use the recursive character splitter, which will split on single letters rather than tokens 🤔

SSiddhant Saurabh

hey, any solution for this issue https://github.com/jerryjliu/llama_index/issues/3242
cc: @Abhishek @Logan M @jerryjliu0 @ravitheja

SSiddhant Saurabh

has anyone face error from open ai embedding this
cc: @ravitheja @Logan M

LLogan M

Did you ever try using the recursive text splitter? It will likely solve most token issues for non-english text

LLogan M

This looks like a network issue with sending requests to openai 👀

AAbhishek

@Logan M Yes, We are trying the RecursiveTextSplitter and Also have posted the code snippet on the github issue mentioned.

AAbhishek

Hey @Logan M, Hope you're well. Any updates on this?

LLogan M

Ngl I haven't looked at this at all 😅 apologies for that

Have you tested on the latest version of llama index? Do you have a document/source code you can provide to replicate the issue?

AAbhishek

Np @Logan M, We don't have the document but Yes, I have shared the code snippet for the same in the mentioned github issue here https://github.com/jerryjliu/llama_index/issues/3242

LLogan M

No going to lie, it's going to be really hard to fix this without having a way to reproduce it. Is there anyway you can create document to reproduce it reliably?

AAbhishek

Hey, I understand but the error doesn't popup when processing a document but It has occured when we received a query method

LLogan M

Right, but that means you indexed some document that causes the issue when querying right? That's the key to replicating and fixing this

Add a reply

Find answers from the community

Chunk size issue