LlamaIndex

Log inLog into community

Find answers from the community

Updated 8 months ago

Token count is missing from llm spans/

Token count is missing from llm spans/

At a glance

The post is a bug report about the inability to get token usage from OpenAI when using stream_chat. Community members discuss the issue, confirming that it is present when using the standard OpenAI LLM as well. They explore ways to inspect the spans and events to determine where the token count information is missing, but are unable to find a solution. Eventually, a community member suggests that the issue may be related to the OpenAI package version, and that upgrading to OpenAI >= 1.26.0 may resolve the problem. The issue is ultimately resolved by upgrading the llama-index library.

Useful resources

·

Bug report:
There is no way to get token usage from openAI when using stream_chat.
Here's a full discussion of the issue https://community.openai.com/t/usage-stats-now-available-when-using-streaming-with-the-chat-completions-api-or-completions-api/738156/17.

L

a

33 comments

Does groq report token counting the same as openai? If not, arize might be missing it

I just confirmed that the issue is also present when using the standard OpenAi llm. Groq and OpenAi report token count in exactly the same way and I am able to monkey patch logging into place to verify that the token_count is not missing in the response from either provider. Yet still, it is not showing up in phoenix. How can I inspect the spans to see if they are getting the prompt_token information added to them correctly?

Openai's response.usage structure:

Plain Text

  "usage": {
        "prompt_tokens": 13,
        "completion_tokens": 7,
        "total_tokens": 20
    },

Groqs response.usage structure:

Plain Text

   "usage": {
    "prompt_tokens": 24,
    "completion_tokens": 377,
    "total_tokens": 401,
    "prompt_time": 0.009,
    "completion_time": 0.774,
    "total_time": 0.783
  }

phoenix allows me to see the span attributes which do not include anything about token counts. Here are the span attributes.

Plain Text

{
  "llm": {
    "invocation_parameters": "...",
    "model_name": "gpt-3.5-turbo",
    "input_messages": [
      {
        "message": {
          "content": "...",
          "role": "system"
        }
      },
      {
        "message": {
          "content": "what is monte carlo tree search, explain it to me simply.",
          "role": "user"
        }
      },
      {
        "message": {
          "content": "...",
          "role": "assistant"
        }
      }
    ]
  },
  "output": {
    "value": "some text that i removed beause it was long"
  },
  "openinference": { "span": { "kind": "LLM" } }
}

Can you point me to where these spans are created so I can make sure that token count is getting added to them?

its all happning inside of arizes code, i cant say for sure

Is there a way for me to observe the events being emitted from llama-index to confirm that they contain the correct token count attributes?

the info is there, on the events, specifically the completion/chat end event
https://docs.llamaindex.ai/en/stable/examples/instrumentation/instrumentation_observability_rundown/

Understood, i'm using custom event handlers to log whats going on with the LLMChatEndEvent and it does not appear that token_count is included anywhere. If this is not a bug, how is it possible to access the tokencount? The below output is the result of the following print statement:```Event type: LLMChatEndEvent{'timestamp': datetime.datetime(2024, 7, 3, 9, 35, 54, 737570), 'id': UUID('15c5e867-3683-402a-ae9e-b710c4f4eed1'), 'span_id': 'BaseEmbedding.get_query_embedding-090a2495-3a9a-4f4f-900b-7ffee0bd8a2d', 'messages': [{'role': <MessageRole.SYSTEM: 'system'>, 'content': "blah blah some context", 'additional_kwargs': {}}, {'role': <MessageRole.USER: 'user'>, 'content': 'say hello world', 'additional_kwargs': {}}, {'role': <MessageRole.ASSISTANT: 'assistant'>, 'content': 'Hello world! How can I assist you today?', 'additional_kwargs': {}}, {'role': <MessageRole.USER: 'user'>, 'content': 'oi say hello world again', 'additional_kwargs': {}}], 'response': {'message': {'role': <MessageRole.ASSISTANT: 'assistant'>, 'content': "Hello world! It's great to interact with you. How can I help you further?", 'additional_kwargs': {}}, 'raw': {'id': 'chatcmpl-9gxFSkhIeU4QBE2c6ZpC5C50j52Mb', 'choices': [Choice(delta=ChoiceDelta(content=None, function_call=None, role=None, tool_calls=None), finish_reason='stop', index=0, logprobs=None)], 'created': 1720024554, 'model': 'gpt-3.5-turbo-0125', 'object': 'chat.completion.chunk', 'system_fingerprint': None}, 'delta': '', 'logprobs': None, 'additional_kwargs': {}}, 'class_name': 'LLMChatEndEvent'}
```

Oh weird, it should be under raw

unless openai changed how their response object works

because raw is the actual response object from openai

If you guys are using the chat completion chunk object: https://platform.openai.com/docs/api-reference/chat/streaming

Then according to open ai docs:

Plain Text

An optional field that will only be present when you set stream_options: {"include_usage": true} in your request. When present, it contains a null value except for the last chunk which contains the token usage statistics for the entire request.

How can I set the include_usage flag?

additional_kwargs={"include_usage": true}

WARNI [llama_index.core.chat_engine.types] Encountered exception writing response to history: Completions.create() got an unexpected keyword argument 'include_usage'

oh my

And some error is thrown internally which prevents the response from being streamed back

541abfb2-426f-48b6-bb90-a731f0032300
2024-07-03 10:28:36.582779
StreamingAgentChatResponse.write_response_to_history-43aae04c-e630-4d99-a66a-9b7e06910192
Event type: StreamChatErrorEvent

This seems like it might be a bug, and from looking at StreamingAgentChatResponse.write_response_to_history its not clear to me wha the patch is. How do you recommend I get around this for now?

Plain Text

>>> llm = OpenAI(include_usage=True)
>>> llm.complete("Hello")
CompletionResponse(text='Hello! How can I assist you today?', additional_kwargs={}, raw={'id': 'chatcmpl-9gyRPsyOZN5YbNiTthBSOaRR5teVi', 'choices': [Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Hello! How can I assist you today?', role='assistant', function_call=None, tool_calls=None))], 'created': 1720029139, 'model': 'gpt-3.5-turbo-0125', 'object': 'chat.completion', 'service_tier': None, 'system_fingerprint': None, 'usage': CompletionUsage(completion_tokens=9, prompt_tokens=8, total_tokens=17)}, logprobs=None, delta=None)
>>>

That worked for me

Probably the llm error is causing the error during streaming/writing to memory

I believe the bug is with llm.stream_chat() I was able to repeat that behavior with llm.complete()

How does openai want include_usage to be passed in? I am stepping through the generator and the usage is never sent back

Should be under stream options

Attachment

Screenshot_2024-07-03_at_11.06.20_AM.png

Note that only a single chunk, right before the final data: [DONE] message will include the token usage

the following results in errors

Plain Text

 self.llm = OpenAI(model="gpt-3.5-turbo", additional_kwargs={"stream_options": {"include_usage": True}})

also confirmed that usage is correctly included in the response when using llm.chat(), just stream_chat() is broken

Heres a link to a discussion of the whole issue, basicallty openai package needs to be upgraded to

openai >= 1.26.0

https://community.openai.com/t/usage-stats-now-available-when-using-streaming-with-the-chat-completions-api-or-completions-api/738156

Should i open an issue for this?

On arize maybe? The usage is being logged in the final event when includeusage is properly set```python{'timestamp': datetime.datetime(2024, 7, 3, 13, 21, 22, 467994), 'id': UUID('b85f61d1-291c-48c4-9a0d-5470c59fea49'), 'span_id': 'OpenAI.stream_chat-0d12f4a6-ef2d-4751-9ceb-a7b317b21208', 'messages': [{'role': <MessageRole.USER: 'user'>, 'content': 'Hello World', 'additional_kwargs': {}}], 'response': {'message': {'role': <MessageRole.ASSISTANT: 'assistant'>, 'content': 'Hello! How can I assist you today?', 'additional_kwargs': {}}, 'raw': {'id': 'chatcmpl-9gzpa8J3cdFXgaEtzfXqhZU5a73fE', 'choices': [], 'created': 1720034482, 'model': 'gpt-3.5-turbo-0125', 'object': 'chat.completion.chunk', 'service_tier': None, 'system_fingerprint': None, 'usage': CompletionUsage(completion_tokens=9, prompt_tokens=9, total_tokens=18)}, 'delta': '', 'logprobs': None, 'additional_kwargs': {}}, 'class_name': 'LLMChatEndEvent'}
```

Okay, it was resolved for me by upgrading llama-index

great!

Add a reply

Sign up and join the conversation on Discord