Find answers from the community

Updated 3 months ago

Thread

Attachment
image.png
L
R
58 comments
What code did you run to hit this? You'll likely need to add some sleep to process the data a little more slowly?
I'm using the summary index in tree summarize mode associated with router query engine
In a condense chat engine
How to proceed in my case, if it is the best practice here ?
ah, and probably creating the summary index is causing this error?
not creating but querying it
Might have to set use_async=False -- although tbh we should probably be better at handling this.

It seems like a document this big is causing too many API requests at a given time πŸ˜…
Async is already false
According to the trace, it's during the use of synthesizer that the error happens, After many retries
Yeah it's linked with using too much token too fast
Without async, it's kind of wild that it's using that much in a minute though
it happens on both gpt 3.5 turbo (0611) and gpt 4 turbo preview
Did you set max_tokens in the LLM definition?
yeah at 2048
this is what happens multiple times after some calls
Kind of weird there is so many retries. Default is 3 retries
Maybe try decreasing max_tokens to avoid the rate limit

Or, you can set max_retries in the LLM to be higher (default is 3)
no it's just a logging bug
there is only one in reality
max token is only to 2048
Right, but it's requesting 2048 tokens so many times in a minute that it's hitting the rate limit
yeah ok I see
(i think, anyways, that's my hypothesis haha)
Yeah will try these solutions, thanks for being always here !
Actually, a better solution might be keeping max_tokens at 2048 (I'm assuming you were trying to avoid cut-off responses), and instead, artificially lower the context window

Plain Text
service_context = ServiceContext.from_defaults(...., context_window=10000)


gpt-3.5-turbo-1106 has a 16K context window. tree_summarize is nearly stuffing the context window in every LLM call (the error above said it requested like 12k tokens)
So if we artficially lower the context window, each request will consume less tokens, and hopefully stay under the token limit
max retries works
it seems to help waiting more
I am sorry, but I cannot provide a full summary of the thesis as this would involve dealing with a large volume of information and specific details. However, I can help you summarise specific parts of the thesis or answer questions on particular topics covered in it. Feel free to ask more targeted questions or request summaries of specific sections.
but I have this now lol
What was they query? Maybe we can modify slightly to not think of it as summarizing an entire thesis? query_engine.query("Highlight the important details from the provided text")
The query was "generate a complete summary fo the thesis" so yeah maybe it is too much
will try this to see
seems to work better but the summary is very short
hmm yea, might need some prompt tweaks
And lol with gpt 4 preview I have the error even with the first fix
trying to put the max retries to 100 lol
(it was set at 10)
the new gpt-4 has 128K context window right? Then you probably really need to artificially shrink it
so like you explained here ?
it could hit your 60,000k per minute limit in one LLM call πŸ˜…
yea like that
I thought for my poor tier1 plan
So yes 150β€―000 TPM
And just saw gpt-3.5-turbo-instruct is
250β€―000 TPM
and just saw it was also 500k TPD, so limit always reached
(in this very very specific case)
Add a reply
Sign up and join the conversation on Discord