Log in
Log into community
Find answers from the community
View all posts
Related posts
Was this helpful?
π
π
π
Powered by
Hall
Inactive
Updated last year
0
Follow
Preparing for the era of 32K context: Ea...
Preparing for the era of 32K context: Ea...
Inactive
0
Follow
At a glance
E
Emanuel Ferreira
last year
Β·
https://together.ai/blog/llama-2-7b-32k?utm_source=bensbites&utm_medium=newsletter&utm_campaign=llms-are-making-robots-smarter
L
E
17 comments
Share
Open in Discord
L
Logan M
last year
Very thorough analysis, nice!
L
Logan M
last year
Notice that the throughput goes from 48 tok/s down to 13.5 tok/s if you fill the entire context window
Thats a huuuge hit, and why RAG is important β€οΈ
E
Emanuel Ferreira
last year
That's awesome
E
Emanuel Ferreira
last year
tok/s is more related to perfomance?
E
Emanuel Ferreira
last year
why that's a good metric?
E
Emanuel Ferreira
last year
if have any content to recommend
L
Logan M
last year
tokens per second is performance yes
Many applications depend on fast response times. Further, a lot of newer algorithms involve multiple LLM calls, so keeping those calls fast is important πͺ
E
Emanuel Ferreira
last year
Definitely!
E
Emanuel Ferreira
last year
There's any paper or content explaining why this perfomance improvement filling the context window?
L
Logan M
last year
Oh sorry, I wasn't clear, the tok/s is going down as you fill the context window, which is not an improvement π
E
Emanuel Ferreira
last year
ooooh gotcha now
E
Emanuel Ferreira
last year
I'm a noob ahahaha
E
Emanuel Ferreira
last year
so that's why the compact and refine is better
E
Emanuel Ferreira
last year
and is related to
https://arxiv.org/pdf/2307.03172.pdf
this as well I think
E
Emanuel Ferreira
last year
if we fill the context window, we will have a decrease in the perfomance, and if this lost in the middle is right, will be less accurate as well
E
Emanuel Ferreira
last year
right? π€
L
Logan M
last year
Yea that's right!
Add a reply
Sign up and join the conversation on Discord
Join on Discord