Find answers from the community

Updated 2 years ago

Chunk size error

At a glance

A community member encountered an error while using the llama_index library, specifically related to the text splitter. Another community member suggested two options to address the issue for Chinese text: 1) Modify the document contents by replacing Chinese periods with spaces, or 2) Modify the text splitter separator to use the Chinese period character. The original community member confirmed that the second option worked for them.

hey , guys i got such error: got such error: File "/Users/apple/opt/anaconda3/envs/llama_index/lib/python3.11/site-packages/llama_index/langchain_helpers/text_splitter.py", line 40, in init
raise ValueError(
ValueError: Got a larger chunk overlap (-17) than chunk size (-172), should be smaller. appreciate any help....
L
C
5 comments
What language are your documents? What do your setting look like, or is everything default?
@Logan M load some chinese text and everything is default..
Ah makes sense!

For chinese text, you have two options that should help

  1. Modify the documents contents before inserting
Plain Text
for doc in documents:
    # replace all Chinese periods to add whitespaces
    doc.text = doc.text.replace("。", ". ")


  1. OR Modify the text splitter seperator
Plain Text
from llama_index.langchain_helpers.text_splitter import TokenTextSplitter
from llama_index.node_parser import SimpleNodeParser
from llama_index import GPTVectorStoreIndex, SimpleDirectoryReader, ServiceContext

splitter = TokenTextSplitter(separator="。")
node_parser = SimpleNodeParser(text_splitter=splitter)
service_context = ServiceContext.from_defaults(node_parser=node_parser)

documents = SimpleDirectoryReader("./data").load_data()

index = GPTVectorStoreIndex.from_documents(documents, service_context=service_context)
@Logan M thank you so much!! it works!!!
Add a reply
Sign up and join the conversation on Discord