Token usage calculation with Google ADK and Gemini-2.5-flash-native-audio-dialog

I built an application using Google ADK with the model gemini-2.5-flash-native-audio-preview-12-2025, and I’m having trouble calculating token usage.

My usage pattern is: the user provides an audio input to the agent, and the agent generates an audio reply based on a text instruction. (I’ve configured RunConfig(response_modalities=["AUDIO"], input_audio_transcription=types.AudioTranscriptionConfig(), output_audio_transcription=types.AudioTranscriptionConfig).)

I would like to get the number of text/audio tokens for each model input and output so I can calculate costs. However, when I print event.usage_metadata, I only get the following:

cache_tokens_details=None cached_content_token_count=None
candidates_token_count=None candidates_tokens_details=None
prompt_token_count=1549
prompt_tokens_details=[
  ModalityTokenCount(modality=TEXT, token_count=1526),
  ModalityTokenCount(modality=AUDIO, token_count=23)
]
thoughts_token_count=104
tool_use_prompt_token_count=None
tool_use_prompt_tokens_details=None
total_token_count=1615
traffic_type=None

I can’t tell which tokens belong to the input audio and which belong to the model’s audio output.
On top of that, the sum of prompt_token_count and thoughts_token_count is greater than total_token_count.

So I’m not sure whether usage_metadata is inaccurate, or if I’m looking in the wrong place to get the information I need.

I’d also like to know whether enabling input_audio_transcription=types.AudioTranscriptionConfig() and output_audio_transcription=types.AudioTranscriptionConfig() results in any additional charges?

Hi @fengdog, welcome to the community!

Please refer here for a guide on token calculation: How token works

And no, it doesn’t cost any additional charges if you enable input_audio_transcription=types.AudioTranscriptionConfig() and output_audio_transcription=types.AudioTranscriptionConfig()

Thank you!

Hi @fengdog,
prompt_token_count are the input tokens and prompt_tokens_details are the details, where modality=TEXT I think are the text tokens of your instructions, or the system prompt, and modality=AUDIO is the user audio input.

It seems to me you are missing the output tokens. For Native Audio model the field name is response_token_count. For token usage of Native Audio model refer to these docs:

and

Anyway, I think there are problems in token count for Native Audio model. It’s been 2 months since I reported this issue: Gemini Live API Reports Triple Prompt Token Consumption

But I haven’t received any response. I hope Google can provide an answer as soon as possible.

I hope I was helpful.

Ciao

Thank you, Srikanta_K_N. Based on the documentation you provided, I now understand that the values in prompt_tokens_details include the token counts for my system prompt and the user’s audio input, so they can be used to correctly calculate the input cost. But what’s strange is that candidates_token_count is None — I hope that doesn’t mean I have to calculate the audio length and convert it into tokens myself, haha!

Also, thanks — now I can confidently use input_audio_transcription and output_audio_transcription!

Thank you.

Thank you for your reply, @Gianluca_Emaldi — your response was very helpful. I’m now able to confirm that the input token calculation is working correctly. And as you mentioned, modality=TEXT corresponds to my instruction, while modality=AUDIO is the user’s input audio.

The remaining issue is that I’m unable to obtain the output token count. In my example, I printed the full event, but I didn’t see the responseTokenCount field that’s described in the Live API documentation. I’m wondering whether the problem might be related to the ADK agent framework.

Also, regarding the token calculation issue you reported with the Live API — I had never really noticed it before, because in my use case the input token values were usually close to what I expected. But from your findings, it does seem like quite a serious problem. Hopefully Google can fix it soon!

hi, regarding live api, besides usagemetadata, how can I see how many output tokens you bill for in each session? right now I have to calculate it myself with 32 tokens/s for audio, and I cannot verify the usage accurately.