Find answers from the community

Updated 2 years ago

Chunk overlap

At a glance

The community member is facing a ValueError related to chunk overlap when processing Chinese documents. They have tried various settings for the chunk overlap, but the issue persists. Another community member suggests using a RecursiveCharacterTextSplitter from the langchain library, which may work better for Chinese text. The suggested code example demonstrates how to set up the text splitter and the ServiceContext in the llama_index library. The community members believe this approach should help the original poster resolve the issue.

Guys how to deal with this error

ValueError: Got a larger chunk overlap (20) than chunk size (-29976), should be smaller.
L
f
8 comments
Any specific settings you've configured? Are your documents in English or something else?
Documents in Chinese
I tried all values for the max_chuck_overlap but they don't work
I tried NOT declaring a prompt evaluator while creating the service context but then I ran into another input size related error
For Chinese, you'll have better luck with a recursive character splitter

One sec, let me type an example
Plain Text
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1024,
    chunk_overlap=200,
)

from llama_index.node_parser import SimpleNodeParser
from llama_index import ServiceContext

service_context = ServiceContext.from_defaults(node_parser=SimpleNodeParser(text_splitter=text_splitter, ...)
I think you should have better luck creating an index with that
Add a reply
Sign up and join the conversation on Discord