Prompt lengths

At a glance

The community members are experiencing an issue where the model's maximum context length is exceeded, causing an error message. They have tried various approaches, such as adjusting the max_token_size during indexing and querying, and have found that the issue seems to be more prevalent with non-English language documents. As a workaround, they have implemented a wrapper that adds a small buffer to the max_token_size during indexing, which appears to have resolved the issue. The community members suspect that the token counting library they are using, tiktoken, may not be 100% accurate, and they suggest that the library should handle this edge case or introduce a buffer to account for it.

RRunonthespot

Hi @jerryjliu0 we've had a few instances of "The model's maximum context length is 4097 tokens (3842 in your prompt, 256 for the completion) please reduce your prompt or completion length.

This is with simple index JSON, and straightforward index.query. It doesn't appear to be possible for us to control it (we feed in the document and a 1 line question), is there something going wrong with the math that calculates the token budget? We are looking into the code ourselves to see if we can identify the issue but would be grateful if you know anything that could help here or have an idea what might be going wrong? I assume it's something in the question + refinement cycle that isn't counting properly.

5 comments

LLogan M

If you are looking at the code, it will probably be something in the prompt helper 🤔

I've ran into this too on occasion. Would be awesome to have a better fix!

RRunonthespot

we figured out as sort of hack. Basically when indexing, if we use a different predictor setting with a slightly bigger max_token_size (say 256 + 10 = 266) but then when querying, use 256 as max_token_size it seems to happen less. We think the tiktoken token counting isn't 100% accurate, so either the library needs to bear this in mind and introduce a buffer, or handle some other way.

RRunonthespot

the underlying assumption is at index time, that size plays a role in chunking size, which in turn gives results - it's usually only an overflow by one or two tokens (and anecdotally has happened in non English language documents, which may indicate that there is variation in token length / tiktoken accuracy in places)

LLogan M

Ohhhhh interesting!

On discord, I definitely see this problem waaay more with non-english documents 🤔

RRunonthespot

We just put a wrapper around creating a predictor, with a parameter passed in called "padding" which just adds 10 tokens to the max_token_size,
Indexing: we call it with padding=10 for the predictor that indexing needs, and
Query: we call it with padding=0 in the actual querying prediction. Haven't seen the issue repeat since.

Add a reply

Find answers from the community

Prompt lengths