LlamaIndex

Log inLog into community

Find answers from the community

Updated 2 years ago

Token limit error

Token limit error

At a glance

·

Hi. I got following error when chat to agent: openai.error.InvalidRequestError: This model's maximum context length is 4097 tokens. However, you requested 4918 tokens (3894 in the messages, 1024 in the completion). Please reduce the length of the messages or completion.

It still raise that even history messages is []

There are my settings:

Attachment

L

Q

4

99 comments

Sounds good!

I had completely rebuild all indexes using new service context, but the issue still exists.
raise self.handle_error_response(
openai.error.InvalidRequestError: This model's maximum context length is 4097 tokens. However, your messages resulted in 4538 tokens. Please reduce the length of the messages.

back to my issue of the beginning,I only got this error when chat to agent.I had never got this error on the time of create index or query from index before.

my agent creation:
tools = LlamaToolkit(
index_configs=index_configs,
graph_configs=[graph_config]
).get_tools()
memory = ConversationBufferMemory(memory_key=f'chat_history')
return initialize_agent([] if tools is None else tools,
llm_predictor_3_5.llm,
agent="conversational-react-description",
memory=memory,
verbose=True,
agent_kwargs={"prefix": prompt_prefix.format(nickname=bot["nickname"])})

Maybe it's because of the tools, or because I'm using a mix of GPT-3.0 and GPT-3.5, such as save and index from gpt-3.0 service context and load it by gpt-3.5 service context.Do these make sense?

Mixing it should be fine, they have the same max input size 🤔

Maybe you need to use a memory module that limits how long the memory gets?

Maybe a summary or window memory can help... I'm not as much of an expert on the langchain side

https://python.langchain.com/en/latest/modules/memory/how_to_guides.html

I suspected a memory problem too, but you can back to my first message of this issue.Even this memory and agent are newly created each time call chat,it still raise this error.at this time there have no history messages in memory.

I once had this problem in memory because of the history messages squeezed, and then I restricted the messages in memory. But that's not the case this time because I'm pretty sure there was also error when the messages in memory was empty.

I feel very strange because this problem suddenly happened in the last two days and we have been using it normally for many days.

I use this code to calculate messages tokens before chat to agent:
message_tokens = llm_predictor_3_5.llm.get_num_tokens_from_messages(messages=agent.memory.chat_memory.messages)
print(f"before chat message tokens {message_tokens}")
answer = agent_run(agent, question)

it print: before chat message tokens 130

but same error still raised :
openai:error_code=context_length_exceeded error_message="This model's maximum context length is 4097 tokens. However, you requested 4868 tokens (3868 in the messages, 1000 in the completion). Please reduce the length of the messages or completion." error_param=messages error_type=invalid_request_error message='OpenAI API error received' stream_error=False

I was tired of this issue, so I tried switching to GPT-4 to fix this because GPT-4 could accept 8k tokens. But it's a pity my request tokens seems always a little bigger than the max tokens limit of openai :

raise self.handle_error_response(
openai.error.InvalidRequestError: This model's maximum context length is 8192 tokens. However, your messages resulted in 9106 tokens. Please reduce the length of the messages.

my setting max_input_size doesn't seem to work.

What if you set max_input_size to be a little smaller than 8k?

That way if it goes over, it will still work

In fact, that might be helpful the for the original problem too..

But that time I use gpt-4 to run my program, I have not change the max_input_size. it's still 4096 as before.

This issue occurred in our production environment and I thought it could be solved without debugging on my premises. Now I reproduce this issue locally. I created 11 indexes, these document content came from 11 normal web pages, the content is not large, and then using these 11 indexes to create an agent for chat according to the previous method it got the same error. When I remove any one or two indexes, recreate the agent, the chat returns to normal.

So I think it's the index tools that cause this issue.

In our production environment, some users have more documents than me.They got this error when chat to agent too. Probably a handful of indexes don't produce this issue, so I haven't got this issue before.

I use llama_index is v0.5.23.post1, it depends langchain 0.0.142.

Right, but maybe lowering this will encourage smaller inputs sent to openai, and hopefully avoiding the issue

I think the main problem here is llama index splits based on spaces by default. And so if the content does not have many spaces, it will not chunk data properly

There is supposed to be a fallback, but it seems like its not working.

I'll try to make a fix in the code for this. If it's possible to share a zip of the minimum data to recreate the issue that would be super helpful.

I tried, you can see my setting now.

Attachment

That's great,I will tidy up these indexes and some python code snippets.Package these sent to you. Thank you.

And I also tried only use index tools to built agent, but when append many indexes it raise this error too.
return [LlamaIndexTool.from_tool_config(index_config) for index_config in index_configs]

Yea, langchain is limited to about 20-30 tools. It depends on how long each name+description is for each tool

Perfect 💪

That's too few, so if I have a lot of indexes, then I should only use one GraphTool to avoid being limited by langchain?

Yea definitely, need to consolidate the indexes somehow. A graph is a good option

That said, this usage should not be advocated and advised:
return LlamaToolkit(
index_configs=index_configs,
graph_configs=[graph_config]
).get_tools()

I think that's fine though?

Because langchain for me, who is exposed through llama_index, didn't know there would be such a limitation.

Most users don't seem to be creating and using more than a handful of indexes at the same time

I think this is more of a special case

That limitation could definitely be noted in the docs though

OK hah .

hi @Logan M , I made a complete case to you. You just replace your OPENAI_API_KEY at top of my_test.py, Then run command: python3 my_test.py "abcd" "hello"
It will raise that error.
The folder "users" contains the index files and a graph file.

This case just a little over the maximum tokens of openai. I want to point out that more indexes, probable the more tokens it will be exceeded.

Please ignore some comments and invalid code, these just illustrate what I tried to do. I've removed our business codes.

Thank you again for your patient help.

Thanks a ton! I'll do my best to dive into this. Might take a day or two 💪🙏

So, I'm just working with the graph by itself.

Was there a particular type of query string that was causing problems? Otherwise I'll keep trying a few

Just want to reproduce the token error, then I can step into the code with a debugger to fully inspect the issue 🙂

you could just run this command: python3 my_test.py "abcd" "hello"

the second param "hello" is a simple query string. It will raise token error too.

ah that did it haha

should have read your first message closer

thanks!

👍

No, it's me who should say thanks!!

Ok, first issue identified -> the tool descriptions are waaay too long, given how large max_tokens is

Here's the entire input when you ask hello

The entire thing is 3144 tokens

So if your num_output/max_tokens is 1000, 3144 + 1000 is too big

Possible solutions here are lowering max_tokens or figuring out ways to shrink the tool descriptions

I would maybe remove all the summary text from each description, and just keep the tags and titles ?

Nice job! You are so professional.
The max_input_size I set is actually 4096-1000=3096 , This field I always feel it had not take effect, What ever what value it is, can not avoid finally generated input tokens may be larger than it. I thought it would used to automatically shorten the entire input.

But I agree with you and accept your suggestion about just using tags and titles as index description. I put it so long just hoping that it would be more accurate to find needed tool.

At the same time, I can also see that there is always a limit to this, and the description of each tool will be submitted to openai, even if I only use a Graph tool, when its description too long, it will cause issue.

Also, I attempt to use only the Graph tool, it didn't works as well as the way that append both index tools and graph tool.

When you construct the llama agent, I think if you pass only the graph tool, it will leave out the index tools right?

Right, the max input size is only used by llama index, but not by the actual langchain agent.

Langchain has less available methods for control sizes it seems

Yes, if I pass only the graph tool, I feel that it doesn't work very well, and sometimes it can't find the corresponding index.

How to got this entire input before I send it to openai? And how to calculate it's total tokens?

Since some users have many index files, I had to abandon using index tools + graph tool to build agent and use only a graph tool instead.

Then a new problem occurred, there is the verbose log of a chat call:

Entering new AgentExecutor chain...

Thought: Do I need to use a tool? Yes
Action: document knowledge library of user "Blockey"
Action Input: "What did Munger say in the annual meeting?">[Level 0] Current response: ANSWER: None of the choices are directly related to what Munger said in the annual meeting.
>[Level 0] Could not retrieve response - no numbers present

Observation: None

Finished chain.

This chat return "None", It's not a humanoid answer. Does that means once it try to use a tool to answer my question, but there have none information about this question in this tool, So return None? Was this caused by some error?
I want got an more human answer, such as "There’s no information about…."

I had to set a breakpoint in the langchain code haha then I could print the prompt before it's sent to openai. You can count tokens using this: https://platform.openai.com/tokenizer

Hmmm yea something weird happened there and the LLM wasn't answering. Maybe you can check for None response and replace it 🤔

Yea it depends on the top level index. Some are better than others. Maybe a vector index on the top level would be better? Then you can also set the top k, which is how many sub indexes it will check

Oh, I don't know vector index could be the top level, So I could do it like this?

graph = ComposableGraph.from_indices(
GPTSimpleVectorIndex,
[index1, index2, index3],
index_summaries=[index1_summary, index2_summary, index3_summary],
)

good idea! If you don't think there have some potential big problem here, I can do it as you said simply.

Yea that should be all you need

If you want to set a config specifically for the top level vector index, and your graph has multiple vector indexes, you can set/use the ID of the top level index and specify that in your query configs

Attachment

(In newer versions this has changed btw. But I think you are still on 0.5.x)

Seems I had seen this page before, I will study it again.

I was considering upgrade the version of llama_index, but the upgrade was delayed due to the previous token problem.
I'm also a little worried about whether upgrade will need me to make big changes of our codes.

I was upgrading your version very frequently until some usage been changed in version 0.6

Yea v0.6.X will change some code for sure 😅

I just reacted, the id here is for this purpose, when the same index type is used in multi-level indexes.

Yup, that's why 💪

Hi, @Logan M . I'm upgrading llama_index of my project from 0.5.7 to 0.6.7 now. I found that the storage for index became a dir, Previously it was just a json file. But for me, as you saw my data of that zip I sent you, Each index have different topic. So I think I should storage each index to different dir for v0.6+, next time load them separately to use.
Do you think this is the right way? Is that the documents with related topics should be append into one index (GPTVectorStoreIndex), otherwise create a new one?

You can definitely create an save dir per index yea. There is also features like the mongo doc store to share nodes between indexes too, but as you said, each index may have its own information.

Eventually you will be able to save all indexes in one dir, but there's a bug with that we found the other day that needs to be fixed first

OK. I haven't done that fast yet.
I want to control the isolation and permissions of indexes through the stored directory structure. So I might still prefer to store the path separately for each index.
Perhaps you have also considered the flexibility to control this, but I don't know much about your new version yet. I still need to take my time.

Hi @Logan M , I'm not sure is there a bug here. Once I set a summary to a GPTVectorStoreIndex using index.index_struct.summary = summary,
next time I load it , found the summary is null, also is null in the file index_store.json.

As you know, I had used the summary field in my case.

Attachment

Attachment

Yea I think the summary is only saved when constructing a graph 🤔 Definitely could be added to the normal index save though

In old version 0.5.*, It could be save and restore for simple index, but can't work for graph.

I need the summary could be save and restore either for index or graph. Can you update for this?

I set it up and then saved it, then I should have restored it, do you think that's right?

For the graph it should be saving the summary? 👀

But yea, it makes sense that it should be saving that

Yes, I thought user could decide what it was stored for, unless you kept this property for another purpose.

OK. maybe for graph it's do not need.
I just think that every time I build a tool, I need to get all the indexes, iterate through them, and then reassemble a graph tool, which is a bit cumbersome.

It was originally intended to save tools that had already been created, but it may not work at this time.

At least, for the summary of index, I'm relying on it now, because I have not store this stuff in the database.

Yea for sure, that makes sense. I took a quick peek but I don't fully understand why it's not saving the summary actually lol

Will take a bit more

It's okay. When you fixed it you can let me know.

https://github.com/jerryjliu/llama_index/pull/3382

Hi @Logan M , If an index created using openai llm, storage data to dir. Can this dir load by other llm such as Claude-v1.3-100k or HuggingFace?
Or index files generated by an LLM can only be loaded by the same LLM StorageContext?

You can change the LLM, but if you change the embed model, you will have to create the index again 👍

I mean if I change the LLM, can I use load_index_from_storage to load a storage dir that created from other LLM?
Does that work?

I haven't considered changing the embed model. I don't know if it will be used.

It's amazing ,maybe when you refactored v0.6 you had considered the ability for users to switch LLMS and use the same storage.

Yea this should work! When you do load from storage, just pass in the service context that has the new LLM predictor setup 💪

Actually, I don't know how to change the embedded model😅

Haha no worries! When people change it, it's using to use a local model from huggingface. But tbh the default openai model works well, and it's very cheap

this should be specified in the doc for storage

Add a reply

Sign up and join the conversation on Discord