jerryjliu98 9313 hi jerry I saw the

At a glance

hi, jerry. I saw the feature on new Sentence Text splitter. it will be called automatically during the opration of creating new index? an other question is: if it can split words in languages not using white space between words, like Chinese? I am using 0.4.32 mainly, and I saw error message about over length term (longer than max_chunk_limit), so I have to process document by a chinese word splitter before creating index, thus I think the build-in splitter not fits languages without white space...

17 comments

jjerryjliu0

hey! that's a good question, i'm actually not sure. @Hongyi Shi have you tested this on other languages?

HHongyi Shi

I haven’t but it should work better since it can split on nonwhitespace

HHongyi Shi

Otoh the sentence tokenizer may not recognize non English punctuation

HHongyi Shi

Maybe worth adding

thanks for your reply. to run a specific word splitter (for me it's a Buddhism dict add on) before creating index will gain benefit at semantic understanding or not? if there is no over length single term error, and there is no benefit on semantic understanding, I won't do this before creating index. because it's a bit complicated and get some increases on token number.

HHongyi Shi

Not a word splitter but the tokenizer used during node creation

HHongyi Shi

I need to do some more testing but if you can provide some test files that would be helpful

HHongyi Shi

Chinese punctuation wasn't supported but I made a PR to add support for it https://github.com/jerryjliu/llama_index/pull/1079

txt sample file uploaded to this PR, thank you very much!

a silly question maybe: to run a specific word splitter (for me it's a Buddhism dict add on) before creating index will gain benefit at semantic understanding for llm or not?

by add white space to the left and right of some words.

HHongyi Shi

Doubtful

HHongyi Shi

Thanks for the test file. The PR has been merged in so you should be able to use it in future versions.

HHongyi Shi

To use it you can do something like

HHongyi Shi

from gpt_index.langchain_helpers.text_splitter import SentenceSplitter
sentence_splitter = SentenceSplitter()
node_parser = SimpleNodeParser(text_splitter=sentence_splitter)
service_context = ServiceContext.from_defaults(node_parser=node_parser)
documents = SimpleDirectoryReader('data').load_data()[0]
index = GPTSimpleVectorIndex.from_documents(documents, service_context=service_context)

thank you! already marked down 😎

jjerryjliu0

thanks for the help @Hongyi Shi !!

Add a reply

Find answers from the community

jerryjliu98 9313 hi jerry I saw the