Find answers from the community

Updated last month

Token

At a glance

The community member is frustrated that the llama index tokenizer only encodes and does not have a decode function. The comments discuss the need for truncating long documents, with some community members suggesting using a splitter and setting the chunk size, while others argue that the best solution is to remove the truncation feature entirely. There is no explicitly marked answer, but the discussion suggests that the community is exploring different approaches to handling large documents.

ALSO its driving me CRAZY that the llama index tokenizer only ENCODES and has no decode im going to scream lol.
L
S
9 comments
Why do you need decode? Its just for token counting πŸ˜…
how do I truncate a document? I can use Settings.encoder to detect a document is too large. how to truncate?
unless I entirely remove the truncation feature from my class. πŸ™‚
which I could. or let people pass their own encoder in. which I think is the best option. and of course you only need to pass an encoder if you are using the truncation feature. otherwise it doesnt amtter
What/where are you trying to truncate? Generally it's either handled for your automatically, or you've already chunked your data to a size
the full documents, which can be over the 200k/128k context limits easily
i give the user the option to warn, throw an error, or ignore.. or truncate first or truncate last
but maybe you're right, the best code is no code
Yea tbh i would be using a splitter and setting the chunk size as needed
I might just delete all my truncation code πŸ™‚ if your documents are too large its the users fault! hah!
Add a reply
Sign up and join the conversation on Discord