try attaching the callback manager directly to the llm
ok so if I set this in llm then I need to delete Settings.callback_manager = CallbackManager([token_counter]) or will not have any impact ?
I'm currently working with Bedrock using the anthropic.claude-3-sonnet-20240229-v1:0 model, and I'm experiencing an unexpected behavior with the output truncation. Despite setting num_output to 9216, the responses I receive are consistently truncated to around 2000 characters. Below is the configuration I'm using:
llm = Bedrock(
model="anthropic.claude-3-sonnet-20240229-v1:0",
temperature=0.1,
context_window=180000,
num_output=9216,
region_name=region,
callback_manager=callback_manager,
additional_kwargs=kwargs
)
The token usage details are as follows:
LLM Prompt Token Usage: 3382
LLM Completion Token Usage: 392
LLM Prompt Token Usage: 3023
LLM Completion Token Usage: 394
**
Trace: chat
|_llm -> 16.008703 seconds
|_llm -> 16.005578 seconds
**
why the output is getting truncated? Is there a limitation with the num_output parameter or something else I might be missing? Thanks
Num output only leaves room for tokens.
You'll want to set max_tokens on the LLM instead (and num_output will get set automatically under the hood)
you mean remove num_output and to set like this llm = Bedrock(
model="anthropic.claude-3-sonnet-20240229-v1:0",
temperature=0.1,
context_window=180000,
max_tokens=9216,
region_name=region,
callback_manager=callback_manager,
additional_kwargs=kwargs
) ?
Yea like that π (for whatever reason I think bedrock has the var as context_size btw, instead of context_window)
I updated as below
llm = Bedrock(
model="anthropic.claude-3-sonnet-20240229-v1:0",
temperature=0.1,
context_size=180000,
max_tokens=9216,
region_name=region,
callback_manager=callback_manager,
additional_kwargs=kwargs
)
but, when using the context_window set to 180000 tokens. I noticed that for a single large input, the LLM appears to make two separate queries, each with significant token usage, instead of one consolidated query. Here are the details:
LLM Prompt Token Usage: 3382
LLM Completion Token Usage: 1259
LLM Prompt Token Usage: 3023
LLM Completion Token Usage: 1261
**
Trace: chat
|_llm -> 30.167279 seconds
|_llm -> 30.164204 seconds
**
I would expect a single query given the high context_window setting.
Could there be a misunderstanding on my part about how context_window influences token usage or query segmentation, or I'm doing something wrong ?
What did you run to generate that?
(Like, what did the code look like)
refactor code, btw ..the answer looks ok
is not clear to me if I should take user content and maybe use SummaryIndex to compose the user content in a summary_index and send this ? instead of directly sending the text
@Logan M is ok for SimpleChatEngine to send directly the user query to LLM or do I need to configure a Index with the question, in case I will have maybe bigger context to send with the question