Is this something to worry about Token

At a glance

The post is about a community member encountering an error message "Token indices sequence length is longer than the specified maximum sequence length for this model". The comments suggest this may be due to the documents being too long for the TokenSplitter to handle. Community members discuss potential solutions, such as using different LLM models, cleaning up the input data, and adjusting the chunk size. One community member shares a hypothesis that the issue may be related to the way the model handles special characters in the text. They also provide a rule of thumb for the relationship between characters and tokens. However, there is no explicitly marked answer in the comments.

Useful resources

vvkdi5cord

Is this something to worry about "Token indices sequence length is longer than the specified maximum sequence length for this model"

16 comments

jjerryjliu0

which LLM model / embedding model are you using? i thought i had fixed some of this

yyourbuddyconner

I was getting a similar error when my documents were too long for the TokenSplitter to handle, though not sure if the text was identical off the top of my head

vvkdi5cord

I was using davinci and ada (the defaults)

jjerryjliu0

if you send the stack trace or DM me a sample dataset+code snippet, i'm happy to help! this is one of the areas of the code that i'd love to make bug free

vvkdi5cord

Sure, I will do that. Only thing is I have to find it as its lost in my chain of terminal windows haha

vvkdi5cord

One issue I had that was related to this is this:

vvkdi5cord

"A single term is larger than the allowed chunk size.\n"
f"Term size: {num_cur_tokens}\n"
f"Chunk size: {self._chunk_size}"

vvkdi5cord

I get this when using a chunk_size_limit=256

vvkdi5cord

and loading this PDF docuemnt: https://www.silabs.com/documents/public/data-sheets/bgm220s-datasheet.pdf

vvkdi5cord

is there a way to figure out the lowest possible chunk size for a document to avoid running into this issue?

vvkdi5cord

also which separator is chosen for chunking?

yyourbuddyconner

Ah, I have experienced this too. When dealing with documents with a lot of special characters, those are often treated by the model as individual tokens. I have noticed that if I don't clean the data first, dropping chunk size too low results in this error due to the sheer number of special characters in there (I suspect).

This is more of a theory to be honest, I got around it by cleaning up the input such that formatting and extra characters are removed (assuming they dont affect the meaning of the text).

However, increasing chunk size worked for me as a kludge until I got around doing that.

Considering this is a datasheet, it's very similar to the documents I was building with and might have a similar issue. PDFs are messy.

jjerryjliu0

ah i thought i had fixed the "a single term is larger than the allowed chunk size" issue but i may need to take a closer look

vvkdi5cord

Thanks!

vvkdi5cord

Looking forward for the release

yyourbuddyconner

RE: this, my hypothesis was kind of correct it turns out based on current understanding.

According to this (https://beta.openai.com/tokenizer), the general rule of thumb for characters -> embeddings is:

A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This translates to roughly ¾ of a word (so 100 tokens ~= 75 words).

My experience with the tokensplitter is that len(vector) > chunk_size pretty much always and that has thrown me off before. Not sure if related to the aforementioned error however.

Add a reply

Find answers from the community

Is this something to worry about Token