Find answers from the community

Updated 4 months ago

Requests

At a glance

The community member is experiencing an issue with the LlamaIndex library, where a single question is generating multiple requests, exceeding the OpenAI rate limit of 3 requests per minute. The comments suggest that this is likely due to the default "agent" chat mode, which involves multiple steps, including deciding which tool to use, querying the index, and interpreting the result.

The community members discuss potential solutions, such as trying different chat modes, adding funds to the OpenAI account, and using a token counting handler to monitor the token usage. They also provide code snippets and explanations about the underlying process, including the retrieval of multiple nodes and the interpretation of the response.

There is no explicitly marked answer in the provided information.

Useful resources
Why in 1 question, llamaindex makes so many requests? This means I can't get an answer because OpenAI's limit only allows 3 requests in 1 minute
Attachment
image.png
L
S
A
16 comments
I'm guessing you just did index.as_chat_engine() ?

By default, this is an agent. Which means 1 LLM call to decide which tool to use (I.e the index), at least one call to query the index, and one call to interpret the result.

Maybe try another chat mode, or add some money to your openai account and set a low $ limit, to get around the rate limits
Oops maybe because I accidentally called the response twice too πŸ˜…
yes you're right i'm currently using index.as_chat_engine()
Did anyone know why do the tokens used increase so much in third attempt? Why is the LLM Prompt 2000ish tokens being called twice even though the contents of my document should be the same?
can you send me the code snippet for getting this token usage stats in llama index?
thanks a lot
This is the query engine -- it's retrieving two nodes (likel 1024 tokens each) and sending them to the LLM along with the query to answer
Why only the third attempt retrieving two nodes?
It's not a third attempt actual. The default a chat engine (an agent) has 3 steps I described earlier

  1. Read user message, either generate a response or write an input to a tool (i.e. the tool in this case is a query engine)
  2. Run the tool with the query (i.e. run the query engine, which retrieves + writes response)
  3. Interpret response in context of previous chat history, give user final answer
Definitely check out other chat modes though
Thank you for the information
Add a reply
Sign up and join the conversation on Discord