Ah makes sense!
For chinese text, you have two options that should help
- Modify the documents contents before inserting
for doc in documents:
# replace all Chinese periods to add whitespaces
doc.text = doc.text.replace("γ", ". ")
- OR Modify the text splitter seperator
from llama_index.langchain_helpers.text_splitter import TokenTextSplitter
from llama_index.node_parser import SimpleNodeParser
from llama_index import GPTVectorStoreIndex, SimpleDirectoryReader, ServiceContext
splitter = TokenTextSplitter(separator="γ")
node_parser = SimpleNodeParser(text_splitter=splitter)
service_context = ServiceContext.from_defaults(node_parser=node_parser)
documents = SimpleDirectoryReader("./data").load_data()
index = GPTVectorStoreIndex.from_documents(documents, service_context=service_context)