Find answers from the community

Updated 3 months ago

Token

At a glance

The community member is frustrated that the llama index tokenizer only encodes and does not have a decode function. The comments discuss the need for truncating long documents, with some community members suggesting using a splitter and setting the chunk size, while others argue that the best solution is to remove the truncation feature entirely. There is no explicitly marked answer, but the discussion suggests that the community is exploring different approaches to handling large documents.

SSnowBloom

ALSO its driving me CRAZY that the llama index tokenizer only ENCODES and has no decode im going to scream lol.

9 comments

LLogan M

Why do you need decode? Its just for token counting 😅

SSnowBloom

how do I truncate a document? I can use Settings.encoder to detect a document is too large. how to truncate?
unless I entirely remove the truncation feature from my class. 🙂

SSnowBloom

which I could. or let people pass their own encoder in. which I think is the best option. and of course you only need to pass an encoder if you are using the truncation feature. otherwise it doesnt amtter

LLogan M

What/where are you trying to truncate? Generally it's either handled for your automatically, or you've already chunked your data to a size

SSnowBloom

the full documents, which can be over the 200k/128k context limits easily

SSnowBloom

i give the user the option to warn, throw an error, or ignore.. or truncate first or truncate last

SSnowBloom

but maybe you're right, the best code is no code

LLogan M

Yea tbh i would be using a splitter and setting the chunk size as needed

SSnowBloom

I might just delete all my truncation code 🙂 if your documents are too large its the users fault! hah!

Add a reply