Find answers from the community

Updated 3 months ago

Text splitting

I've tried use the chunk_size_limit, but it doesn't seem to be working. It always say the term size is larger than the effective trunk size.
L
K
A
12 comments
Is your text in English, or something else?

You might have better luck with a different text splitter.

You can use any text splitter from langchain. There's a list of splitters here: https://langchain.readthedocs.io/en/latest/reference/modules/text_splitter.html

Then when you construct your index, you can pass it in with text_splitter=MyTextSplitter() or something like that
Thanks, Im using Chinese Character, it looks like the sperator cannot be reconized. I will try some different text splitter.
By the way, Im very confused that we instantiate the splitter here right? And we dont pass any args to the TokenTextSplitter, why the printed log showes the chunk_size is 397, instead of 4000?
Attachment
image.png
Yea it's a little confusing

There's two levels of splitting -> the initial text_splitter and the prompt_helper

This is because we can't know ahead of time how much room will be in the prompt for text
Ahhh maybe this is why (referred to other Q)
Haha I was just going to tag you here
Okk, thanks for your explanation. Then there is an another question, why the parent class BaseGPTIndex call the methon self._build_fallback_text_splitter() will invoke the child class method?
Every index is an extension of BaseGPTIndex, and calls super(), so that line will still be run (unless you provide your own text splitter)

If you are super curious how the code works, I recommend stepping through the code using a debugger like pycharm or PDB πŸ’ͺ
Hi Logan, thanks for ur help, I can build the tree index successfully. However, there is a word limit of output, it looks like the output is truncated from query. Any suggesstions?
Nice!

Our new FAQ (pinned in the issues channel) has some links regarding the cut off responses

https://docs.google.com/document/d/1bLP7301n4w9_GsukIYvEhZXVAvOMWnrxMy089TYisXU/edit?usp=sharing
Cool! By the way, the number of output tokens is considered into the total max_input_tokens i.e. 4096 right? In other words, the output token increasing means the input token decreasing?
Yea exactly! It's connected like that
Add a reply
Sign up and join the conversation on Discord