Hi @jerryjliu0 we've had a few instances of "The model's maximum context length is 4097 tokens (3842 in your prompt, 256 for the completion) please reduce your prompt or completion length.
This is with simple index JSON, and straightforward index.query. It doesn't appear to be possible for us to control it (we feed in the document and a 1 line question), is there something going wrong with the math that calculates the token budget? We are looking into the code ourselves to see if we can identify the issue but would be grateful if you know anything that could help here or have an idea what might be going wrong? I assume it's something in the question + refinement cycle that isn't counting properly.
we figured out as sort of hack. Basically when indexing, if we use a different predictor setting with a slightly bigger max_token_size (say 256 + 10 = 266) but then when querying, use 256 as max_token_size it seems to happen less. We think the tiktoken token counting isn't 100% accurate, so either the library needs to bear this in mind and introduce a buffer, or handle some other way.
the underlying assumption is at index time, that size plays a role in chunking size, which in turn gives results - it's usually only an overflow by one or two tokens (and anecdotally has happened in non English language documents, which may indicate that there is variation in token length / tiktoken accuracy in places)
We just put a wrapper around creating a predictor, with a parameter passed in called "padding" which just adds 10 tokens to the max_token_size, Indexing: we call it with padding=10 for the predictor that indexing needs, and Query: we call it with padding=0 in the actual querying prediction. Haven't seen the issue repeat since.