No worries! Where is the confusing part to you?
what does "input" refer to in "max_input_size"?
the input to the model -> GPT is a decoder based model, and it has a limited context window.
Basically, this means there is an absolute cap to how much can be feed into the model, max_input_size (which for gpt3 and gpt3.5, is 4096 tokens)
It being a decoder model is important, because they predict one token at a time until a special stop token is predicted.
After each token is predicted, it is added to the input and the next token is predicted
So technically, it will keep predicting until the special stop token, or until the input becomes greater than max_input_size
I highly recommend this blog post for better details, a lot of this is specific to the architecture of GPT:
https://jalammar.github.io/illustrated-gpt2/It talks about GPT2, but GPT3 is basically the same except bigger
Other models are a little different, because they use encoder/decoder architectures instead (like googles FLAN-T5)
It's my understanding that the GPT models have only a single token limitation, and that is on the maximum tokens involved in both the prompt, and the completion
And for 3.5, that limit is 4096.
It doesnt make sens to describe this token limit as the "max input"
Unless im misunderstanding something
It's described as a "max input size" because 4096 is the max - if you try to input 4097 tokens into the model, you'll get an error
So back to the calculation
max_input_size=4096
prompt_tokens=200 (a guess)
num_output=256 (the maximum number of expected output tokens)
chunk_size=4096-200-256
Now, the model might not use all 256 output tokens that we left room for, but we can't know this ahead of time so we've left space for them. Remember, each token generated is then added back into the input. So that's we need to leave that "space"
Maybe this is going in circles though lol that's about the best I can explain it π
the word "input" was really throwing me off
No worries! I hope it's a little more clear! π
Most chat models are using this same architecture, so it should be the same idea for most models. The only one that's slightly different (that i can think of) is google FLAN, but maybe don't worry about that unless you use that model lol
as a follow up - shouldn't we be subtracting padding * num_chunks on line 112?
since the padding is the space between chunks, num_chunks * padding gives us the aggregated padding amount
Since the function is calculating the size of a single chunk, no need to worry about how many other chunks there are
If llama index ends up creating 10 chunks, and the padding is one, that will be accumulated across all chunks like you said
hmm ok, so it takes "num_chunks" as a parameter just to fuck with me, ey
bc it's always going to be 1 is what you're saying i think
Lol yea! Looking at it closer, we also divide by num chunks before we subtract the padding
After that division, it basically turns into "size per 1 chunk", and so we subtract the padding given a single chunk