Find answers from the community

Updated last year

Preparing for the era of 32K context: Ea...

L
E
17 comments
Very thorough analysis, nice!
Notice that the throughput goes from 48 tok/s down to 13.5 tok/s if you fill the entire context window

Thats a huuuge hit, and why RAG is important ❀️
tok/s is more related to perfomance?
why that's a good metric?
if have any content to recommend
tokens per second is performance yes

Many applications depend on fast response times. Further, a lot of newer algorithms involve multiple LLM calls, so keeping those calls fast is important πŸ’ͺ
There's any paper or content explaining why this perfomance improvement filling the context window?
Oh sorry, I wasn't clear, the tok/s is going down as you fill the context window, which is not an improvement πŸ˜…
I'm a noob ahahaha
so that's why the compact and refine is better
if we fill the context window, we will have a decrease in the perfomance, and if this lost in the middle is right, will be less accurate as well
Yea that's right!
Add a reply
Sign up and join the conversation on Discord