Find answers from the community

Updated 2 years ago

Preparing for the era of 32K context: Ea...

At a glance

https://together.ai/blog/llama-2-7b-32k?utm_source=bensbites&utm_medium=newsletter&utm_campaign=llms-are-making-robots-smarter

17 comments

LLogan M

Very thorough analysis, nice!

LLogan M

Notice that the throughput goes from 48 tok/s down to 13.5 tok/s if you fill the entire context window

Thats a huuuge hit, and why RAG is important ❤️

EEmanuel Ferreira

That's awesome

EEmanuel Ferreira

tok/s is more related to perfomance?

EEmanuel Ferreira

why that's a good metric?

EEmanuel Ferreira

if have any content to recommend

LLogan M

tokens per second is performance yes

Many applications depend on fast response times. Further, a lot of newer algorithms involve multiple LLM calls, so keeping those calls fast is important 💪

EEmanuel Ferreira

Definitely!

EEmanuel Ferreira

There's any paper or content explaining why this perfomance improvement filling the context window?

LLogan M

Oh sorry, I wasn't clear, the tok/s is going down as you fill the context window, which is not an improvement 😅

EEmanuel Ferreira

ooooh gotcha now

EEmanuel Ferreira

I'm a noob ahahaha

EEmanuel Ferreira

so that's why the compact and refine is better