Find answers from the community

s
F
Y
a
P
Updated 2 years ago

Rate limits

It's baffling. It's a complete blocker, going to turn to Langchain indexes only to try and solve.
Even if I wait, when I try again I still hit the limit, even for a very small number of document that are each very small
L
x
39 comments
What kind of index are you building? It looks like the LLM is the one hitting rate limits πŸ€”
every index that uses GPT encounters limits with no matter how few documents or tokens. I've tried them all, but I want to use the GPTKeyword table one primarily.
Hmm not sure what to tell ya πŸ€” I've personally not encountered any rate errors, and I haven't seen too many complaints in the discord about it. Some people have indexed entire books πŸ˜…

I agree llama index could probably have a better mechanism to slow down requests though, for situations like this.
I would appreciate a bit more logging as well, I have logging set to debug, but I wish I could see more details into what's going on
What details are you interested in?
If I get time, I can look at expanding the debug logs. But also any PRs from anyone are extremely welcome πŸ™πŸ™
yes, have been looking into contributing on the github, just posted an issue.

specifically it would be great to have 'timestamp: request made to OpenAI endpoint' 'timestamp: response from OpenAi'
and the ability to have that log during index creation and query time
if that already exists and I haven't enabled it, that would be SO helpful
Ah awesome! Yea that doesn't exist yet as far as I know, but would be easy enough to add πŸ™
amazing! yes. right now, whenever I create an index or query, I have no idea how many calls are actually being made.
it's been making me want to just build my own implementation of llama index which is not the rigth next step haha
Hahaha yea that might be a little too much work πŸ˜…

One quick option without waiting for a PR is just editing the package yourself.

You can open the installation location on your machine and add whatever print() or logs you want to the source code
Definitely a temp option, but good for debugging
I work entirely on colab these days, so it's a little more annoying to do, but yes that's my next step. literally can't run anything on llama index right now, rate limited ? shadow banned? I can't figure it out
appreciate your help!
what's also weird is I tried with the openai library directly, and those queries go through fine...
But those queries would be one at at a time right? A little slower I suppose.

Since you are running on colab, I wouldn't be surprised if OpenAI just sees a million requests from colab and automatically shadow bans them or something silly πŸ˜…

If you are on windows, using WSL works great with llama index, and removes any windows-specific complications that come with python
when I run a single index.query on a keyword table index

how many openai queries will that be? any approximate idea?
but your point on COLAB and OpenAi throttling that. that's actually a GREAT POINT.
With default settings, probably about one LLM query per 4000 tokens
ok well, making a single index.query("does this work?") gets me that rate limit error on an index with 5 documents in it, each no more than 100 tokens :/
oh wait, a query! whoops
With a query, it will send an LLM call for every document (since the documents are small) with a matching keyword + one call to the LLM to get the keywords from the query
So roughly 6 max in your case
You can try index.query("my query", response_mode="compact") to stuff as much as it can into one LLM call
isn't it weird that I'm still getting rate limited from that ?
Very weird! I blame colab at this point lol
kk good to know thanks again.

last question:

I feel like I misunderstand something about llama index.

Once I've built the index. I understand that this takes a bunch of calls and tokens etc.

Then when I run the index.query --
I expected it to something smart like sort the index, take the top N best results (all locally)
THEN make an LLM call to summarize the results, then it puts those into the context + the prompt to get the answer.

How wrong am i ? @Logan M I feel like you're one of two people who knows the answer
So how the query works depends on the index you are using.

If you want it to work like the operation you described, use GPTSimpleVectorIndex

Then, you can write a query like this index.query("my query", similarity_top_k=3, similarity_cutoff=0.3)

This retrieves the top 3 most relevant nodes, and removes any with a similarity less than 0.3

If you use default settings, the top k is 1 and the cutoff is 0 (I.e. uses any node it finds)
gotcha. and keyword table index is very different?
A keyword index will return any text chunk that has the some keywords as the query. So quite different yes.

There's a good page on how each index works over here: https://gpt-index.readthedocs.io/en/latest/guides/index_guide.html
I live on that page
I've spent the past 1 month crawled up inside llamaindex's brain, I'm a data scientist who generally knows what he's doing, and I really struggle with getting llama index + langchain to play nice
I want to help improve the docs as well, I think I have some ideas of getting the overall explanation to be a little more approachable
just need to get my mvp off the ground then I can work on improvments !
oh that would be very awesome! πŸ™
will probably make a medium post
once I understand hahaha
The biggest challenge lol
Add a reply
Sign up and join the conversation on Discord