Find answers from the community

Updated 2 years ago

Please help A very big problem

At a glance
The community member is facing an issue where they are continuously hitting an exception on exceeding the token number when setting similarity_top_k=5 for the as_chat_engine function. They have implemented a custom node postprocessor to exclude nodes when the token count is higher, but it doesn't help. The community member is using the GPT-3.5-turbo model and has a chunk size of 1024. They have tried various approaches, including decreasing the top_k value, checking metadata, and printing node sizes, but the issue persists. The community member is unable to reproduce the issue locally, but it occurs when deployed to AWS. They are unsure of how to correctly count the tokens and ensure the returning nodes are being used.
Useful resources
Please help! A very big problem!
When setting similarity_top_k=5 for as_chat_engine function, I continuously hit the exception on exceeding the tokens number. I tried to implement a custom node postprocessrot to exclude a node as soon as # of tokens is higher. But it doesn't help at all:

Plain Text
class EnumCustomPostprocessor():

    query_len = 0
    def __init__(self, query): # Here, the query is prompt
        self.query_len = (num_tokens_from_string(query) if query is not None else 0) + 100 # Just in case

    def postprocess_nodes(
        self, nodes: List[NodeWithScore], query_bundle: Optional[QueryBundle]
    ) -> List[NodeWithScore]:
        
        # subtracts 1 from the score
        whole_tokens = self.query_len
        final_nodes = []
        for n in nodes:
            tokens = num_tokens_from_string(n.node.text)
            whole_tokens += tokens
            if whole_tokens > 2800:
                break
            else:
                final_nodes.append(n)

        return final_nodes


The size of chunks is 1024. This is where I use it:

Plain Text
            query_engine = index.as_chat_engine(verbose=True,chat_mode="context",
                similarity_top_k=5,
                system_prompt=prepared_system_prompt, 
                        node_postprocessors=[EnumCustomPostprocessor(query_text + prepared_system_prompt)])

Despite all my tricks it's not working and I throw the exception anyway. Please help, so critical for me. similarity_top_k should be 5.
This approach is apparently not working but I have no idea how I can control number of tokens. I thought it's already implemented in library but @Logan M told me it's not. 😦
Thanks!
T
S
L
78 comments
@SeaCat Which LLM are you using? I just tried running the chat engine with top k 5 and 1024 chunk sizes, worked fine for me. If you could send more of your code I can try helping but as a quick fix have you considered using gpt-3.5-turbo-16k?
With 16k I was able to comfortably do top k 10
It's gpt-3.5-turbo.
Are you adding a lot of metadata?
Hmmmm probably but honestly, I don't know. Where should I look at?
node.metadata πŸ™‚

Also, I wonder if you are counting tokens correctly? That post-processor should work (assuming you have a newer vesrion of llama-index)
Ah, okay, I have to check. I pass some metadata but don't think it's too much, but who knows. I have to check. And yes, updated recently :)\
Yeah had the same thought about the counting
Have you considered swap to 16k?
Hmmm I don't know... probably not, right now I'd like to solve the problem for the current LLM. I've already decreased top_k, temporarily to 3 and would like to find a fix. Let me check the metadata first!
Metadata all empty. The craziest thing is I can't repro it locally, it doesn't work when deployed... have no idea why... UPD I forgot to change the parameter, maybe with it the metadata is different. I have to check againπŸ€¦β€β™€οΈ
Update. Checked again, but meta is empty. The only additional information is extra_info for documents but it can't be much (just URLs of sources) and I even don't know if this information participate in the request. The most awful thing is that I can't repro on my local computer but as soon as I deploy to the server (AWS) it starts throwing exceptions with the same data. It drives me nuts. I have no idea which variables to log to see what's going on.
I would try printing the node sizes in your post-processor to confirm that a) the post-processor is actually being used and b) the postprocessor is working
Okay, thanks, let me try to do it
Should it be n.node.text length? (n is node in nodes)
the actual length to measure should be

Plain Text
from llama_index.schema import MetadataMode

n.node.get_content(metadata_mode=MetadataMode.LLM)
This is what the LLM will end up seeing
Ahh I see, let me try to use it
It seems to me I know what the heck is that
Even before I checked if the postprocessor exists I knew I forgot to update the requirements.txt πŸ€¦β€β™€οΈ πŸ€¦β€β™€οΈ πŸ€¦β€β™€οΈ πŸ€¦β€β™€οΈ πŸ€¦β€β™€οΈ
Did you get it resolved?
Not yet. But at least I know the postprocessor is called. What I see there is in my case when there is exception, node lengths are:

3775, 2804, 3785. I also included 3 nodes but it still throws the exception " "openai.error.InvalidRequestError: This model's maximum context length is 4097 tokens. However, your messages resulted in 4313 tokens. Please reduce the length of the messages.\n""
I need to count the tokens too I guess
Yup, that's my thought
Looks like it's not doing it currently
Hmm what do you mean?
Counting the tokens correctly
Ahh okay. Yeah, maybe I don't know. I need to log the tokens number - I count them with tiktoken
Do you have your chunk size set to 1024?
Yes, the chunk size is 1024 and I can't change it (because nodes are already in the database). What I see is the total token count in the postprocessr is 2655. Then question is where the heck are other 2,000 tokens from???
can you print/share a sample of the node content that are longer than 1024 tokens?
do you mean the content of node?
lol words are hard
Sure:
(Discord doesn't allow me to publish such long text here :))), so it's just a pic:
Attachment
image.png
that looks much shorter than 3700 tokens tbh

Can you share the function you are using to count tokens?
It's around 1k tokens I think?
3700 is text length, not tokens,
Yeah thats the character count
Rough estimate
Attachment
image.png
it's one of 3 nodes, totally they are 2655 tokens
so, according to the calculations, I return 3 nodes, totally 2655 tokens. Now, the question, how can I make sure these nodes are really in use, not the original ones?
Because I have a feeling even though the postprocessor is called, the returning nodes are not being used πŸ€” else I can't explain 2,000 more tokens. I have an idea
No, it's working okay and response.source_nodes has exactly the number of nodes in use. Does the response object have the information about the request to be sent to OpenAI or how can I obtain this information somehow?
easiest option is turning on debug

Plain Text
import openai
openai.log = "debug"
Thanks, let me try it
Oh, this is crazy. I can't explain it. Locally, I have ""prompt_tokens": 2700,\n "completion_tokens": 242" which is great but why I do have different numbers on my local machine?😫
Okay, I'm going to deploy this code to see what's going on on the server, else I won't be able to know🫑
Damn. When it throws the exception, it actually doesn't add any information, only this what is pretty useless:

Plain Text
Sep  6 18:59:52 ip-172-31-5-146 web: body='{\n  "error": {\n    "message": "This model\'s maximum context length is 4097 tokens. However, your messages resulted in 4246 tokens. Please reduce the length of the messages.",\n    "type": "invalid_request_error",\n    "param": "messages",\n    "code": "context_length_exceeded"\n  }\n}\n' headers='{\'Date\': \'Wed, 06 Sep 2023 18:57:04 GMT\', \'Content-Type\': \'application/json\', \'Content-Length\': \'281\', \'Connection\': \'keep-alive\', \'access-control-allow-origin\': \'*\', \'openai-organization\': \'user-\gdz\', \'openai-processing-ms\': \'50\', \'openai-version\': \'2020-10-01\', \'strict-transport-security\': \'max-age=150; includeSubDomains\', \'x-ratelimit-limit-requests\': \'3500\', \'x-ratelimit-limit-tokens\': \'90000\', \'x-ratelimit-remaining-requests\': \'3499\', \'x-ratelimit-remaining-tokens\': \'85423\', \'x-ratelimit-reset-requests\': \'17ms\', \'x-ratelimit-reset-tokens\': \'3.05s\', \'x-request-id\': \'f663bc\', \'CF-Cache-Status\': \'DYNAMIC\', \'Server\': \', \'CF-RAY\': \'80YYZ\', \'alt-svc\': \'h3=":443"; ma=86400\'}' message='API response body'
And it's superinconsistent. Sometimes it throws exception, sometimes (with the same query) doesn't
When I include the chathistory, is it a part of completion?
Ah, it sure is
You can change the length of the memory, by default it cuts off at 1500 tokens
Or something like that
Ohhh I see now what's going on. The previous responses were very long too. By default, I add 5 latest pairs user-system plus one specific. But can you please explain how I can control how many messages to include in the history - except the manual counting tokens?
It's probably better to use a token limit like Logan mentioned
thanks! but not a clue how to use this token limit
should I do it with PromptHelper or some other objects/functions?
Plain Text
memory = ChatMemoryBuffer.from_defaults(token_limit=1500)
It's better to keep it dynamic like that, appending a certain fixed amount of messages doesn't work as well. With a set fixed amount you tend to run into issues
oh, thanks, I see now. But how can I keep it dynamic? token_limits=1500 doesn't look very dynamic
I just meant it's more dynamic than a fixed amount of messages. Lets say all 5 messages are usually 500 tokens. What if the LLM gives a very short response? Now you're essentially losing out on context compared to a token limit
Yeah I see. Now, I'm struggling to find the description of this parameter, and what it should be. The documentation has everything but the search just sucks. I never was able to find anything there
what is 1500? may be should it be 500? Not sure how to use it
I think the default value is 3000 tokens, 1500 would be half of that
Just a way to control token usage, lower you put it; the less you'll use for the memory functionality. Depends on how much of the conversation you want to track
Also the lower you put the value, the more space you'll have for retrieving context. That means it could help you solve your current issue
Yeah, I think I have to allocate specific numbers for: history, context, response. The last part can't be predicted or can it be? I found max_token can limit but not sure if it can be used in chat engine
Correct. You can configure each one. For the max token you can pass it like this:

Plain Text
llm=OpenAI(model="gpt-3.5-turbo-16k", temperature=0, max_tokens=1000)
Cool, thank you! Eventually, I can limit all the parts dynamically
Yup! πŸ’ͺ
Hopefully you can get the issue resolved now, if you have more questions just ask
Yeah absolutely! Thank you guys @Teemu and @Logan M for your help!! (I'm implementing the logic now πŸ™‚
No worries! Happy to help! ❀️
Add a reply
Sign up and join the conversation on Discord