Text splitting

KKleinMoretti

I've tried use the chunk_size_limit, but it doesn't seem to be working. It always say the term size is larger than the effective trunk size.

12 comments

LLogan M

Is your text in English, or something else?

You might have better luck with a different text splitter.

You can use any text splitter from langchain. There's a list of splitters here: https://langchain.readthedocs.io/en/latest/reference/modules/text_splitter.html

Then when you construct your index, you can pass it in with text_splitter=MyTextSplitter() or something like that

KKleinMoretti

Thanks, Im using Chinese Character, it looks like the sperator cannot be reconized. I will try some different text splitter.

KKleinMoretti

By the way, Im very confused that we instantiate the splitter here right? And we dont pass any args to the TokenTextSplitter, why the printed log showes the chunk_size is 397, instead of 4000?

Attachment

LLogan M

Yea it's a little confusing

There's two levels of splitting -> the initial text_splitter and the prompt_helper

This is because we can't know ahead of time how much room will be in the prompt for text

AAndreaSel93

Ahhh maybe this is why (referred to other Q)

LLogan M

Haha I was just going to tag you here

KKleinMoretti

Okk, thanks for your explanation. Then there is an another question, why the parent class BaseGPTIndex call the methon self._build_fallback_text_splitter() will invoke the child class method?

LLogan M

Every index is an extension of BaseGPTIndex, and calls super(), so that line will still be run (unless you provide your own text splitter)

If you are super curious how the code works, I recommend stepping through the code using a debugger like pycharm or PDB 💪

KKleinMoretti

Hi Logan, thanks for ur help, I can build the tree index successfully. However, there is a word limit of output, it looks like the output is truncated from query. Any suggesstions?

LLogan M

Nice!

Our new FAQ (pinned in the issues channel) has some links regarding the cut off responses

https://docs.google.com/document/d/1bLP7301n4w9_GsukIYvEhZXVAvOMWnrxMy089TYisXU/edit?usp=sharing

KKleinMoretti

Cool! By the way, the number of output tokens is considered into the total max_input_tokens i.e. 4096 right? In other words, the output token increasing means the input token decreasing?

LLogan M

Yea exactly! It's connected like that

Add a reply

Find answers from the community

Text splitting