Chunk size error

At a glance

A community member encountered an error while using the llama_index library, specifically related to the text splitter. Another community member suggested two options to address the issue for Chinese text: 1) Modify the document contents by replacing Chinese periods with spaces, or 2) Modify the text splitter separator to use the Chinese period character. The original community member confirmed that the second option worked for them.

CChatGPT-fan

hey , guys i got such error: got such error: File "/Users/apple/opt/anaconda3/envs/llama_index/lib/python3.11/site-packages/llama_index/langchain_helpers/text_splitter.py", line 40, in init
raise ValueError(
ValueError: Got a larger chunk overlap (-17) than chunk size (-172), should be smaller. appreciate any help....

5 comments

LLogan M

What language are your documents? What do your setting look like, or is everything default?

CChatGPT-fan

@Logan M load some chinese text and everything is default..

LLogan M

Ah makes sense!

For chinese text, you have two options that should help

Modify the documents contents before inserting

Plain Text

for doc in documents:
    # replace all Chinese periods to add whitespaces
    doc.text = doc.text.replace("。", ". ")

OR Modify the text splitter seperator

Plain Text

from llama_index.langchain_helpers.text_splitter import TokenTextSplitter
from llama_index.node_parser import SimpleNodeParser
from llama_index import GPTVectorStoreIndex, SimpleDirectoryReader, ServiceContext

splitter = TokenTextSplitter(separator="。")
node_parser = SimpleNodeParser(text_splitter=splitter)
service_context = ServiceContext.from_defaults(node_parser=node_parser)

documents = SimpleDirectoryReader("./data").load_data()

index = GPTVectorStoreIndex.from_documents(documents, service_context=service_context)

CChatGPT-fan

@Logan M thank you so much!! it works!!!

LLogan M

amazing!!

Add a reply

Find answers from the community

Chunk size error