@SeaCat Which LLM are you using? I just tried running the chat engine with top k 5 and 1024 chunk sizes, worked fine for me. If you could send more of your code I can try helping but as a quick fix have you considered using gpt-3.5-turbo-16k?
With 16k I was able to comfortably do top k 10
Are you adding a lot of metadata?
Hmmmm probably but honestly, I don't know. Where should I look at?
node.metadata
π
Also, I wonder if you are counting tokens correctly? That post-processor should work (assuming you have a newer vesrion of llama-index)
Ah, okay, I have to check. I pass some metadata but don't think it's too much, but who knows. I have to check. And yes, updated recently :)\
Yeah had the same thought about the counting
Have you considered swap to 16k?
Hmmm I don't know... probably not, right now I'd like to solve the problem for the current LLM. I've already decreased top_k, temporarily to 3 and would like to find a fix. Let me check the metadata first!
Metadata all empty. The craziest thing is I can't repro it locally, it doesn't work when deployed... have no idea why... UPD I forgot to change the parameter, maybe with it the metadata is different. I have to check againπ€¦ββοΈ
Update. Checked again, but meta is empty. The only additional information is extra_info for documents but it can't be much (just URLs of sources) and I even don't know if this information participate in the request. The most awful thing is that I can't repro on my local computer but as soon as I deploy to the server (AWS) it starts throwing exceptions with the same data. It drives me nuts. I have no idea which variables to log to see what's going on.
I would try printing the node sizes in your post-processor to confirm that a) the post-processor is actually being used and b) the postprocessor is working
Okay, thanks, let me try to do it
Should it be n.node.text length? (n is node in nodes)
the actual length to measure should be
from llama_index.schema import MetadataMode
n.node.get_content(metadata_mode=MetadataMode.LLM)
This is what the LLM will end up seeing
Ahh I see, let me try to use it
It seems to me I know what the heck is that
Even before I checked if the postprocessor exists I knew I forgot to update the requirements.txt π€¦ββοΈ π€¦ββοΈ π€¦ββοΈ π€¦ββοΈ π€¦ββοΈ
Not yet. But at least I know the postprocessor is called. What I see there is in my case when there is exception, node lengths are:
3775, 2804, 3785. I also included 3 nodes but it still throws the exception " "openai.error.InvalidRequestError: This model's maximum context length is 4097 tokens. However, your messages resulted in 4313 tokens. Please reduce the length of the messages.\n""
I need to count the tokens too I guess
Looks like it's not doing it currently
Counting the tokens correctly
Ahh okay. Yeah, maybe I don't know. I need to log the tokens number - I count them with tiktoken
Do you have your chunk size set to 1024?
Yes, the chunk size is 1024 and I can't change it (because nodes are already in the database). What I see is the total token count in the postprocessr is 2655. Then question is where the heck are other 2,000 tokens from???
can you print/share a sample of the node content that are longer than 1024 tokens?
do you mean the content of node?
Sure:
(Discord doesn't allow me to publish such long text here :))), so it's just a pic:
that looks much shorter than 3700 tokens tbh
Can you share the function you are using to count tokens?
It's around 1k tokens I think?
3700 is text length, not tokens,
Yeah thats the character count
it's one of 3 nodes, totally they are 2655 tokens
so, according to the calculations, I return 3 nodes, totally 2655 tokens. Now, the question, how can I make sure these nodes are really in use, not the original ones?
Because I have a feeling even though the postprocessor is called, the returning nodes are not being used π€ else I can't explain 2,000 more tokens. I have an idea
No, it's working okay and response.source_nodes has exactly the number of nodes in use. Does the response object have the information about the request to be sent to OpenAI or how can I obtain this information somehow?
easiest option is turning on debug
import openai
openai.log = "debug"
Oh, this is crazy. I can't explain it. Locally, I have ""prompt_tokens": 2700,\n "completion_tokens": 242" which is great but why I do have different numbers on my local machine?π«
Okay, I'm going to deploy this code to see what's going on on the server, else I won't be able to knowπ«‘
Damn. When it throws the exception, it actually doesn't add any information, only this what is pretty useless:
Sep 6 18:59:52 ip-172-31-5-146 web: body='{\n "error": {\n "message": "This model\'s maximum context length is 4097 tokens. However, your messages resulted in 4246 tokens. Please reduce the length of the messages.",\n "type": "invalid_request_error",\n "param": "messages",\n "code": "context_length_exceeded"\n }\n}\n' headers='{\'Date\': \'Wed, 06 Sep 2023 18:57:04 GMT\', \'Content-Type\': \'application/json\', \'Content-Length\': \'281\', \'Connection\': \'keep-alive\', \'access-control-allow-origin\': \'*\', \'openai-organization\': \'user-\gdz\', \'openai-processing-ms\': \'50\', \'openai-version\': \'2020-10-01\', \'strict-transport-security\': \'max-age=150; includeSubDomains\', \'x-ratelimit-limit-requests\': \'3500\', \'x-ratelimit-limit-tokens\': \'90000\', \'x-ratelimit-remaining-requests\': \'3499\', \'x-ratelimit-remaining-tokens\': \'85423\', \'x-ratelimit-reset-requests\': \'17ms\', \'x-ratelimit-reset-tokens\': \'3.05s\', \'x-request-id\': \'f663bc\', \'CF-Cache-Status\': \'DYNAMIC\', \'Server\': \', \'CF-RAY\': \'80YYZ\', \'alt-svc\': \'h3=":443"; ma=86400\'}' message='API response body'
And it's superinconsistent. Sometimes it throws exception, sometimes (with the same query) doesn't
When I include the chathistory, is it a part of completion?
You can change the length of the memory, by default it cuts off at 1500 tokens
Ohhh I see now what's going on. The previous responses were very long too. By default, I add 5 latest pairs user-system plus one specific. But can you please explain how I can control how many messages to include in the history - except the manual counting tokens?
It's probably better to use a token limit like Logan mentioned
thanks! but not a clue how to use this token limit
should I do it with PromptHelper or some other objects/functions?
memory = ChatMemoryBuffer.from_defaults(token_limit=1500)
It's better to keep it dynamic like that, appending a certain fixed amount of messages doesn't work as well. With a set fixed amount you tend to run into issues
oh, thanks, I see now. But how can I keep it dynamic? token_limits=1500 doesn't look very dynamic
I just meant it's more dynamic than a fixed amount of messages. Lets say all 5 messages are usually 500 tokens. What if the LLM gives a very short response? Now you're essentially losing out on context compared to a token limit
Yeah I see. Now, I'm struggling to find the description of this parameter, and what it should be. The documentation has everything but the search just sucks. I never was able to find anything there
what is 1500? may be should it be 500? Not sure how to use it
I think the default value is 3000 tokens, 1500 would be half of that
Just a way to control token usage, lower you put it; the less you'll use for the memory functionality. Depends on how much of the conversation you want to track
Also the lower you put the value, the more space you'll have for retrieving context. That means it could help you solve your current issue
Yeah, I think I have to allocate specific numbers for: history, context, response. The last part can't be predicted or can it be? I found max_token can limit but not sure if it can be used in chat engine
Correct. You can configure each one. For the max token you can pass it like this:
llm=OpenAI(model="gpt-3.5-turbo-16k", temperature=0, max_tokens=1000)
Cool, thank you! Eventually, I can limit all the parts dynamically
Hopefully you can get the issue resolved now, if you have more questions just ask
Yeah absolutely! Thank you guys @Teemu and @Logan M for your help!! (I'm implementing the logic now π
No worries! Happy to help! β€οΈ