Hi! When using query_engine alone, I can

At a glance

Hi! When using query_engine alone, I can control the size of data to not to exceed the context window:

Plain Text

chatmemory = ChatMemoryBuffer.from_defaults(token_limit=(history_limit + context_limit))
query_engine = index.as_chat_engine(chat_mode='condense_plus_context', 
                                                similarity_top_k=similarity_top_k, 
                                                llm=llm_engine,
                                                system_prompt=prepared_system_prompt,
                                                memory=chatmemory)

When using an agent, I'm trying to do the same:

Plain Text

agent = OpenAIAgent.from_tools(
            tools, llm=self.llm_engine, 
            verbose=True, 
            system_prompt=self.system_prompt,
            memory_cls=chatmemory # <=token_limit:14385
        )

But some tool's output is still too big and I have the exception "This model's maximum context length is 16385 tokens. However, you requested 17561 tokens (15405 in the messages, 156 in the functions, and 2000 in the completion)". Why is that and how to fix it? Thanks!

13 comments

LLogan M

Theres no assumptions made on the size of tool outputs

Maybe using a load-and-search tool would help, to wrap tools that return large outputs

SSeaCat

And additional quesiton is "15405 in the messages, 156 in the functions" what are "messages" and "functions" here?

SSeaCat

Sorry, can you please elaborate on "using a load-and-search tool "?
And additional quesiton is "15405 in the messages, 156 in the functions" what are "messages" and "functions" here?

LLogan M

The OpenAI api uses messages and functions/tool calls. Both count towards the overall token count

LLogan M

Basically, you can wrap any tool with a load and search tool
https://docs.llamaindex.ai/en/stable/module_guides/deploying/agents/tools/llamahub_tools_guide/#loadandsearchtoolspec

It will create an index on the fly if your tool is outputting long outputs

LLogan M

Or you could implement this yourself in the definition for you tool (if its a function tool)

SSeaCat

Thank you! Let me look into it.

SSeaCat

You know what? It worked for me greatly! But I don't have understanding why wrapping with this load-and-search class solved the problem. Where can I learn more about it?

LLogan M

Probably reading the source code is the best spot 🙂
(click the first source code dropdown)
https://docs.llamaindex.ai/en/stable/api_reference/tools/load_and_search/

SSeaCat

Ah, nice! Thank you!

LLogan M

basically it creates a load tool and a read tool (which creates and index and query engine under the hood)

SSeaCat

Yeah, I got this part but does the index engine check the context window itself? Because when I don't use tools and use the chat engine, I have to pass the calculated chat memory, else there could be the same exception.

LLogan M

Since its using a query engine, the context window is handled. It takes all the tool output and does a top-k search on top of it

Add a reply

Find answers from the community

Hi! When using query_engine alone, I can