Hello, how I can configure token

At a glance

Hello, how I can configure token_counter for SimpleChatEngine? I'm using

Plain Text

token_counter = TokenCountingHandler(
        tokenizer=tiktoken.encoding_for_model("gpt-3.5-turbo").encode,
        verbose=True
    )
Settings.callback_manager = CallbackManager([token_counter])

but no results

13 comments

LLogan M

try attaching the callback manager directly to the llm

AAndrei

ok so if I set this in llm then I need to delete Settings.callback_manager = CallbackManager([token_counter]) or will not have any impact ?

LLogan M

Try both 🙂

AAndrei

I'm currently working with Bedrock using the anthropic.claude-3-sonnet-20240229-v1:0 model, and I'm experiencing an unexpected behavior with the output truncation. Despite setting num_output to 9216, the responses I receive are consistently truncated to around 2000 characters. Below is the configuration I'm using:

llm = Bedrock(
model="anthropic.claude-3-sonnet-20240229-v1:0",
temperature=0.1,
context_window=180000,
num_output=9216,
region_name=region,
callback_manager=callback_manager,
additional_kwargs=kwargs
)

The token usage details are as follows:

LLM Prompt Token Usage: 3382
LLM Completion Token Usage: 392
LLM Prompt Token Usage: 3023
LLM Completion Token Usage: 394
**
Trace: chat
|_llm -> 16.008703 seconds
|_llm -> 16.005578 seconds
**

why the output is getting truncated? Is there a limitation with the num_output parameter or something else I might be missing? Thanks

LLogan M

Num output only leaves room for tokens.

You'll want to set max_tokens on the LLM instead (and num_output will get set automatically under the hood)

AAndrei

you mean remove num_output and to set like this llm = Bedrock(
model="anthropic.claude-3-sonnet-20240229-v1:0",
temperature=0.1,
context_window=180000,
max_tokens=9216,
region_name=region,
callback_manager=callback_manager,
additional_kwargs=kwargs
) ?

LLogan M

Yea like that 👍 (for whatever reason I think bedrock has the var as context_size btw, instead of context_window)

AAndrei

I updated as below

llm = Bedrock(
model="anthropic.claude-3-sonnet-20240229-v1:0",
temperature=0.1,
context_size=180000,
max_tokens=9216,
region_name=region,
callback_manager=callback_manager,
additional_kwargs=kwargs
)

but, when using the context_window set to 180000 tokens. I noticed that for a single large input, the LLM appears to make two separate queries, each with significant token usage, instead of one consolidated query. Here are the details:

LLM Prompt Token Usage: 3382
LLM Completion Token Usage: 1259
LLM Prompt Token Usage: 3023
LLM Completion Token Usage: 1261
**
Trace: chat
|_llm -> 30.167279 seconds
|_llm -> 30.164204 seconds
**

I would expect a single query given the high context_window setting.

Could there be a misunderstanding on my part about how context_window influences token usage or query segmentation, or I'm doing something wrong ?

LLogan M

What did you run to generate that?

LLogan M

(Like, what did the code look like)

AAndrei

refactor code, btw ..the answer looks ok

AAndrei

is not clear to me if I should take user content and maybe use SummaryIndex to compose the user content in a summary_index and send this ? instead of directly sending the text

AAndrei

@Logan M is ok for SimpleChatEngine to send directly the user query to LLM or do I need to configure a Index with the question, in case I will have maybe bigger context to send with the question

Add a reply

Find answers from the community

Hello, how I can configure token_counter