Find answers from the community

Updated 2 years ago

Is this something to worry about Token

Is this something to worry about "Token indices sequence length is longer than the specified maximum sequence length for this model"
j
y
v
16 comments
which LLM model / embedding model are you using? i thought i had fixed some of this
I was getting a similar error when my documents were too long for the TokenSplitter to handle, though not sure if the text was identical off the top of my head
I was using davinci and ada (the defaults)
if you send the stack trace or DM me a sample dataset+code snippet, i'm happy to help! this is one of the areas of the code that i'd love to make bug free
Sure, I will do that. Only thing is I have to find it as its lost in my chain of terminal windows haha
One issue I had that was related to this is this:
"A single term is larger than the allowed chunk size.\n"
f"Term size: {num_cur_tokens}\n"
f"Chunk size: {self._chunk_size}"
I get this when using a chunk_size_limit=256
is there a way to figure out the lowest possible chunk size for a document to avoid running into this issue?
also which separator is chosen for chunking?
Ah, I have experienced this too. When dealing with documents with a lot of special characters, those are often treated by the model as individual tokens. I have noticed that if I don't clean the data first, dropping chunk size too low results in this error due to the sheer number of special characters in there (I suspect).

This is more of a theory to be honest, I got around it by cleaning up the input such that formatting and extra characters are removed (assuming they dont affect the meaning of the text).

However, increasing chunk size worked for me as a kludge until I got around doing that.

Considering this is a datasheet, it's very similar to the documents I was building with and might have a similar issue. PDFs are messy.
ah i thought i had fixed the "a single term is larger than the allowed chunk size" issue but i may need to take a closer look
Looking forward for the release
RE: this, my hypothesis was kind of correct it turns out based on current understanding.

According to this (https://beta.openai.com/tokenizer), the general rule of thumb for characters -> embeddings is:

A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This translates to roughly ¾ of a word (so 100 tokens ~= 75 words).

My experience with the tokensplitter is that len(vector) > chunk_size pretty much always and that has thrown me off before. Not sure if related to the aforementioned error however.
Add a reply
Sign up and join the conversation on Discord