Find answers from the community

Updated 2 years ago

Chunk size error

hey , guys i got such error: got such error: File "/Users/apple/opt/anaconda3/envs/llama_index/lib/python3.11/site-packages/llama_index/langchain_helpers/text_splitter.py", line 40, in init
raise ValueError(
ValueError: Got a larger chunk overlap (-17) than chunk size (-172), should be smaller. appreciate any help....
L
C
5 comments
What language are your documents? What do your setting look like, or is everything default?
@Logan M load some chinese text and everything is default..
Ah makes sense!

For chinese text, you have two options that should help

  1. Modify the documents contents before inserting
Plain Text
for doc in documents:
    # replace all Chinese periods to add whitespaces
    doc.text = doc.text.replace("。", ". ")


  1. OR Modify the text splitter seperator
Plain Text
from llama_index.langchain_helpers.text_splitter import TokenTextSplitter
from llama_index.node_parser import SimpleNodeParser
from llama_index import GPTVectorStoreIndex, SimpleDirectoryReader, ServiceContext

splitter = TokenTextSplitter(separator="。")
node_parser = SimpleNodeParser(text_splitter=splitter)
service_context = ServiceContext.from_defaults(node_parser=node_parser)

documents = SimpleDirectoryReader("./data").load_data()

index = GPTVectorStoreIndex.from_documents(documents, service_context=service_context)
@Logan M thank you so much!! it works!!!
Add a reply
Sign up and join the conversation on Discord