Find answers from the community

Updated 5 months ago

Token count is missing from llm spans/

Bug report:
There is no way to get token usage from openAI when using stream_chat.
Here's a full discussion of the issue https://community.openai.com/t/usage-stats-now-available-when-using-streaming-with-the-chat-completions-api-or-completions-api/738156/17.
L
a
33 comments
Does groq report token counting the same as openai? If not, arize might be missing it
I just confirmed that the issue is also present when using the standard OpenAi llm. Groq and OpenAi report token count in exactly the same way and I am able to monkey patch logging into place to verify that the token_count is not missing in the response from either provider. Yet still, it is not showing up in phoenix. How can I inspect the spans to see if they are getting the prompt_token information added to them correctly?

Openai's response.usage structure:

Plain Text
  "usage": {
        "prompt_tokens": 13,
        "completion_tokens": 7,
        "total_tokens": 20
    },


Groqs response.usage structure:

Plain Text
   "usage": {
    "prompt_tokens": 24,
    "completion_tokens": 377,
    "total_tokens": 401,
    "prompt_time": 0.009,
    "completion_time": 0.774,
    "total_time": 0.783
  }
phoenix allows me to see the span attributes which do not include anything about token counts. Here are the span attributes.

Plain Text
{
  "llm": {
    "invocation_parameters": "...",
    "model_name": "gpt-3.5-turbo",
    "input_messages": [
      {
        "message": {
          "content": "...",
          "role": "system"
        }
      },
      {
        "message": {
          "content": "what is monte carlo tree search, explain it to me simply.",
          "role": "user"
        }
      },
      {
        "message": {
          "content": "...",
          "role": "assistant"
        }
      }
    ]
  },
  "output": {
    "value": "some text that i removed beause it was long"
  },
  "openinference": { "span": { "kind": "LLM" } }
}
Can you point me to where these spans are created so I can make sure that token count is getting added to them?
its all happning inside of arizes code, i cant say for sure
Is there a way for me to observe the events being emitted from llama-index to confirm that they contain the correct token count attributes?
Understood, i'm using custom event handlers to log whats going on with the LLMChatEndEvent and it does not appear that token_count is included anywhere. If this is not a bug, how is it possible to access the tokencount? The below output is the result of the following print statement:```Event type: LLMChatEndEvent{'timestamp': datetime.datetime(2024, 7, 3, 9, 35, 54, 737570), 'id': UUID('15c5e867-3683-402a-ae9e-b710c4f4eed1'), 'span_id': 'BaseEmbedding.get_query_embedding-090a2495-3a9a-4f4f-900b-7ffee0bd8a2d', 'messages': [{'role': <MessageRole.SYSTEM: 'system'>, 'content': "blah blah some context", 'additional_kwargs': {}}, {'role': <MessageRole.USER: 'user'>, 'content': 'say hello world', 'additional_kwargs': {}}, {'role': <MessageRole.ASSISTANT: 'assistant'>, 'content': 'Hello world! How can I assist you today?', 'additional_kwargs': {}}, {'role': <MessageRole.USER: 'user'>, 'content': 'oi say hello world again', 'additional_kwargs': {}}], 'response': {'message': {'role': <MessageRole.ASSISTANT: 'assistant'>, 'content': "Hello world! It's great to interact with you. How can I help you further?", 'additional_kwargs': {}}, 'raw': {'id': 'chatcmpl-9gxFSkhIeU4QBE2c6ZpC5C50j52Mb', 'choices': [Choice(delta=ChoiceDelta(content=None, function_call=None, role=None, tool_calls=None), finish_reason='stop', index=0, logprobs=None)], 'created': 1720024554, 'model': 'gpt-3.5-turbo-0125', 'object': 'chat.completion.chunk', 'system_fingerprint': None}, 'delta': '', 'logprobs': None, 'additional_kwargs': {}}, 'class_name': 'LLMChatEndEvent'}
```
Oh weird, it should be under raw
unless openai changed how their response object works
because raw is the actual response object from openai
If you guys are using the chat completion chunk object: https://platform.openai.com/docs/api-reference/chat/streaming

Then according to open ai docs:

Plain Text
An optional field that will only be present when you set stream_options: {"include_usage": true} in your request. When present, it contains a null value except for the last chunk which contains the token usage statistics for the entire request.
How can I set the include_usage flag?
additional_kwargs={"include_usage": true}
WARNI [llama_index.core.chat_engine.types] Encountered exception writing response to history: Completions.create() got an unexpected keyword argument 'include_usage'
And some error is thrown internally which prevents the response from being streamed back
541abfb2-426f-48b6-bb90-a731f0032300
2024-07-03 10:28:36.582779
StreamingAgentChatResponse.write_response_to_history-43aae04c-e630-4d99-a66a-9b7e06910192
Event type: StreamChatErrorEvent
This seems like it might be a bug, and from looking at StreamingAgentChatResponse.write_response_to_history its not clear to me wha the patch is. How do you recommend I get around this for now?
Plain Text
>>> llm = OpenAI(include_usage=True)
>>> llm.complete("Hello")
CompletionResponse(text='Hello! How can I assist you today?', additional_kwargs={}, raw={'id': 'chatcmpl-9gyRPsyOZN5YbNiTthBSOaRR5teVi', 'choices': [Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Hello! How can I assist you today?', role='assistant', function_call=None, tool_calls=None))], 'created': 1720029139, 'model': 'gpt-3.5-turbo-0125', 'object': 'chat.completion', 'service_tier': None, 'system_fingerprint': None, 'usage': CompletionUsage(completion_tokens=9, prompt_tokens=8, total_tokens=17)}, logprobs=None, delta=None)
>>> 
That worked for me
Probably the llm error is causing the error during streaming/writing to memory
I believe the bug is with llm.stream_chat() I was able to repeat that behavior with llm.complete()
How does openai want include_usage to be passed in? I am stepping through the generator and the usage is never sent back
Should be under stream options
Attachment
Screenshot_2024-07-03_at_11.06.20_AM.png
Note that only a single chunk, right before the final data: [DONE] message will include the token usage
the following results in errors

Plain Text
 self.llm = OpenAI(model="gpt-3.5-turbo", additional_kwargs={"stream_options": {"include_usage": True}})
also confirmed that usage is correctly included in the response when using llm.chat(), just stream_chat() is broken
Heres a link to a discussion of the whole issue, basicallty openai package needs to be upgraded to

openai >= 1.26.0

https://community.openai.com/t/usage-stats-now-available-when-using-streaming-with-the-chat-completions-api-or-completions-api/738156
Should i open an issue for this?
On arize maybe? The usage is being logged in the final event when includeusage is properly set```python{'timestamp': datetime.datetime(2024, 7, 3, 13, 21, 22, 467994), 'id': UUID('b85f61d1-291c-48c4-9a0d-5470c59fea49'), 'span_id': 'OpenAI.stream_chat-0d12f4a6-ef2d-4751-9ceb-a7b317b21208', 'messages': [{'role': <MessageRole.USER: 'user'>, 'content': 'Hello World', 'additional_kwargs': {}}], 'response': {'message': {'role': <MessageRole.ASSISTANT: 'assistant'>, 'content': 'Hello! How can I assist you today?', 'additional_kwargs': {}}, 'raw': {'id': 'chatcmpl-9gzpa8J3cdFXgaEtzfXqhZU5a73fE', 'choices': [], 'created': 1720034482, 'model': 'gpt-3.5-turbo-0125', 'object': 'chat.completion.chunk', 'service_tier': None, 'system_fingerprint': None, 'usage': CompletionUsage(completion_tokens=9, prompt_tokens=9, total_tokens=18)}, 'delta': '', 'logprobs': None, 'additional_kwargs': {}}, 'class_name': 'LLMChatEndEvent'}
```
Okay, it was resolved for me by upgrading llama-index
Add a reply
Sign up and join the conversation on Discord