Find answers from the community

Updated 3 months ago

Hey all, I'm having a heck of a time

Hey all, I'm having a heck of a time managing token limits for GPT 3.5 Turbo in llama-index 0.10.7 (I keep overrunning). I'm using the Chat Engine, and after diving into the code, I've implemented some token counters to monitor token usage but still, for example, my most recent chat counted the "all messages" variable, which I think is what eventually is sent is, and came up with 15875 tokens, then I eventually got an overrun error from OpenAI that their limit is 16385 and my total was 16443 (I think including their response). I see in me auditing what's being sent that a huge amount of chat history is being sent, along with my RAG data, system prompt etc. I can see sometimes it truncates stuff to fit under the limit, but other times it doesn't seem to or will give mthe "initial token limit" error, soemtimes it'll go through to OpenAI and error this way. Is there a better more reliable way to make sure the token limit doesn't overrun as opposed to just "doing smaller/fewer chunks"?
1
L
T
D
16 comments
What kind of chat engine are you using? Sounds like a context one?
Shouldn't the ChatMemoryBuffer work for this?
If the number of chunks retrieved is too big, nope
(If its a context chat engine that is)
Yeah I just meant in terms of acculumating too much history, in addition with reducing chunk size/amount
@Darthus can you share some code? πŸ™‚
(I don't want to keep speculating haha)
Thanks Logan (and @Teemu _, here's how I'm defining my chat engine:

Plain Text
chat_engine = index.as_chat_engine(
        chat_mode="context", 
        retriever_mode="embedding", 
        similarity_top_k=5,
        node_postprocessors=[reranker],
        verbose=True,
        system_prompt=" ".join((DM_Prompt, History_Prompt, Character_Prompt)),
        memory=memory
    )


and I'm setting the settings like this:

Plain Text
    Settings.llm=llm
    Settings.chunk_size=512
    Settings.callback_manager=CallbackManager([token_counter])
    Settings.chunk_overlap=25
    Settings.embed_model=embed_model
    Settings.num_output=512


I increased the "num_output" from the default of 256, because I noticed that was what seemed to cause it to overrun before (it was just under the limit for GPT 3.5 and ran over with the output).

I am seeing messages like this sometimes:

Plain Text
Query has been truncated from the right to 256 tokens from 1271 tokens.
so it does seem to be doing some clipping perhaps to keep things under the limit, but I'm not sure what.
It seems like chat_engine doesn't have the same ability to "get_prompts" to analyze the prompts and token counts before sending, so I'm using this code I cobbled together from looking at the chat_engine code:

Plain Text
context_str_template, nodes = await chat_engine._agenerate_context(message)
        prefix_messages = chat_engine._get_prefix_messages_with_context(context_str_template)
        initial_token_count = len(
            chat_engine._memory.tokenizer_fn(
                " ".join([(m.content or "") for m in prefix_messages])
            )
        )
        all_messages = prefix_messages + chat_engine._memory.get(
            initial_token_count=initial_token_count
        )
        all_messages_token_count = len(
            chat_engine._memory.tokenizer_fn(
                " ".join([(m.content or "") for m in all_messages])
            )
        )

        print(f"Context String: {context_str_template}")
        print(f"Prefix messages: {prefix_messages}")
        #print(f"Nodes: {nodes}")
        print(f"Initial Token Count: {initial_token_count}")
        print(f"All Messages: {all_messages}")
        print(f"All Message Token Count: {all_messages_token_count}")
Since increasing the num_output, I haven't had as many overruns, but I'm just not sure if I'm using the best techniques, and am not sure what's being trimmed/cut, between the chat history, my system prompt (which is long, I keep jamming more stuff in there) and the RAG chunks.
It seems like chat_engine doesn't have the same ability to "get_prompts" to analyze the prompts and token counts before sending
I usually just hook it up with a observability platform and read the final prompt there, on a web UI.

The platform of my choice is Arize Phoenix, because it's fully local.
Integrating with it takes 4 lines of code:
https://github.com/tslmy/agent/blob/e330255806c97a93a733dab3edd9a843902375f5/main.py#L23-L32
Plain Text
import phoenix as px
px.launch_app()
from llama_index.core import set_global_handler
set_global_handler("arize_phoenix")

It looks like this on the right:
Attachment
300367903-1ce09f03-1ff5-4e51-bed8-77d281ddad41.png
Thanks for this tip, yeah I tried to hook into WandB and it focused in more on the like LLM processing time etc rather than auditing the raw prompt content. I'll take a look into it.
Btw @Logan M and @Teemu I'm still running into this issue, even when expanding the output context size to 512. This time I had about 20-30 turns, and I can see that the chat history in my latest request goes quite a bit back, but then I ran into "'error': {'message': "This model's maximum context length is 16385 tokens. However, your messages resulted in 16445 tokens. Please reduce the length of the messages.", 'type': 'invalid_request_error', 'param': 'messages', 'code': 'context_length_exceeded'" finally after a while. It looks like the message size was slowly gaining over time as the chat history grew. I just can't really zero in on the logic in Lllamaindex where the request/output length is managed to stay under the context length. It seems like some logic of cropping the system prompt, chat history, rag results and actual user message exists somehow (I see it cropping sometimes), but I seem to be somehow overrunning that logic and am not sure how correct it (like if I knew for example the system prompt size was not included in llamaindex's decision of how to crop the history to fit in the context window, I could calculate the max size and ensure it stays under that.
Concerned about deploying this for people ot use, and on the 40th request it just starts to overrun the context window and basically needs to be reset to recover.
@Darthus try lowering the token limit on the memory

Plain Text
from llama_index.core.memory import ChatMemoryBuffer

memory = ChatMemoryBuffer.from_defaults(token_limit=3000)

index.as_chat_engine(..., memory=memory)
Thanks. So, in order to calculate the size of the various parts of the request/response, it'd be:

1) System prompt
2) User message
3) Chat History (managed through memory token_limit)
4) RAG results (managed through chunk size * simlilary_top_k)
5) LLM Output (managed through Settings.num_output)

Am I missing anything and any tips on monitoring/setting limits on these components? Idea being if I can set/monitor limits on each of these and ensure they all added up don't extend past the 16385 token limit, I should be good. I don't know if there are existing limits on user message size, and also I think my issue might be system prompt, I'm adding contextual information there for the LLM to keep track of over time, but I'm getting a sense the other sizes arne't flexing to accomodate that and I don't have a sense of what limit I should be setting there for myself.
Add a reply
Sign up and join the conversation on Discord