LlamaIndex

Log inLog into community

Find answers from the community

Updated 6 months ago

Thread

Thread

At a glance

The community members are discussing issues with generating a summary of a large document using a language model. They are encountering rate limiting errors and high token usage, which is causing the process to fail. The community members suggest several potential solutions, such as:

- Decreasing the max_tokens parameter to avoid hitting the rate limit - Increasing the max_retries to handle the errors better - Artificially lowering the context window size to reduce token usage per request - Modifying the query to be less ambitious, such as "Highlight the important details from the provided text" instead of "generate a complete summary"

There is no explicitly marked answer, but the community members are working together to troubleshoot the issue.

·

Attachment

L

R

58 comments

What code did you run to hit this? You'll likely need to add some sleep to process the data a little more slowly?

Yeah maybe

I'm using the summary index in tree summarize mode associated with router query engine

In a condense chat engine

How to proceed in my case, if it is the best practice here ?

ah, and probably creating the summary index is causing this error?

not creating but querying it

oh hmm

Might have to set use_async=False -- although tbh we should probably be better at handling this.

It seems like a document this big is causing too many API requests at a given time 😅

Async is already false

According to the trace, it's during the use of synthesizer that the error happens, After many retries

Yeah it's linked with using too much token too fast

Without async, it's kind of wild that it's using that much in a minute though

it happens on both gpt 3.5 turbo (0611) and gpt 4 turbo preview

Did you set max_tokens in the LLM definition?

yeah at 2048

Attachment

this is what happens multiple times after some calls

Attachment

Attachment

Yea I see

Kind of weird there is so many retries. Default is 3 retries

Maybe try decreasing max_tokens to avoid the rate limit

Or, you can set max_retries in the LLM to be higher (default is 3)

no it's just a logging bug

there is only one in reality

max token is only to 2048

Right, but it's requesting 2048 tokens so many times in a minute that it's hitting the rate limit

yeah ok I see

(i think, anyways, that's my hypothesis haha)

Yeah will try these solutions, thanks for being always here !

Actually, a better solution might be keeping max_tokens at 2048 (I'm assuming you were trying to avoid cut-off responses), and instead, artificially lower the context window

Plain Text

service_context = ServiceContext.from_defaults(...., context_window=10000)

gpt-3.5-turbo-1106 has a 16K context window. tree_summarize is nearly stuffing the context window in every LLM call (the error above said it requested like 12k tokens)

So if we artficially lower the context window, each request will consume less tokens, and hopefully stay under the token limit

max retries works

it seems to help waiting more

I am sorry, but I cannot provide a full summary of the thesis as this would involve dealing with a large volume of information and specific details. However, I can help you summarise specific parts of the thesis or answer questions on particular topics covered in it. Feel free to ask more targeted questions or request summaries of specific sections.

but I have this now lol

LOL oh my

What was they query? Maybe we can modify slightly to not think of it as summarizing an entire thesis? query_engine.query("Highlight the important details from the provided text")

The query was "generate a complete summary fo the thesis" so yeah maybe it is too much

will try this to see

seems to work better but the summary is very short

hmm yea, might need some prompt tweaks

And lol with gpt 4 preview I have the error even with the first fix

trying to put the max retries to 100 lol

(it was set at 10)

the new gpt-4 has 128K context window right? Then you probably really need to artificially shrink it

yeah

so like you explained here ?

it could hit your 60,000k per minute limit in one LLM call 😅

yea like that

it's 150k

ah ok

I thought for my poor tier1 plan

So yes 150 000 TPM

And just saw gpt-3.5-turbo-instruct is
250 000 TPM

and just saw it was also 500k TPD, so limit always reached

(in this very very specific case)

:PepeHands:

Add a reply

Sign up and join the conversation on Discord