Find answers from the community

s
F
Y
a
P
Updated last month

jerryjliu98 9313 hi jerry I saw the

hi, jerry. I saw the feature on new Sentence Text splitter. it will be called automatically during the opration of creating new index? an other question is: if it can split words in languages not using white space between words, like Chinese? I am using 0.4.32 mainly, and I saw error message about over length term (longer than max_chunk_limit), so I have to process document by a chinese word splitter before creating index, thus I think the build-in splitter not fits languages without white space...
j
H
V
17 comments
hey! that's a good question, i'm actually not sure. @Hongyi Shi have you tested this on other languages?
I haven’t but it should work better since it can split on nonwhitespace
Otoh the sentence tokenizer may not recognize non English punctuation
Maybe worth adding
thanks for your reply. to run a specific word splitter (for me it's a Buddhism dict add on) before creating index will gain benefit at semantic understanding or not? if there is no over length single term error, and there is no benefit on semantic understanding, I won't do this before creating index. because it's a bit complicated and get some increases on token number.
Not a word splitter but the tokenizer used during node creation
I need to do some more testing but if you can provide some test files that would be helpful
Chinese punctuation wasn't supported but I made a PR to add support for it https://github.com/jerryjliu/llama_index/pull/1079
txt sample file uploaded to this PR, thank you very much!
a silly question maybe: to run a specific word splitter (for me it's a Buddhism dict add on) before creating index will gain benefit at semantic understanding for llm or not?
by add white space to the left and right of some words.
Thanks for the test file. The PR has been merged in so you should be able to use it in future versions.
To use it you can do something like
from gpt_index.langchain_helpers.text_splitter import SentenceSplitter
sentence_splitter = SentenceSplitter()
node_parser = SimpleNodeParser(text_splitter=sentence_splitter)
service_context = ServiceContext.from_defaults(node_parser=node_parser)
documents = SimpleDirectoryReader('data').load_data()[0]
index = GPTSimpleVectorIndex.from_documents(documents, service_context=service_context)
thank you! already marked down 😎
thanks for the help @Hongyi Shi !!
Add a reply
Sign up and join the conversation on Discord