Token usage calculation with Google ADK and Gemini-2.5-flash-native-audio-dialog

fengdog · January 2, 2026, 6:57am

I built an application using Google ADK with the model gemini-2.5-flash-native-audio-preview-12-2025, and I’m having trouble calculating token usage.

My usage pattern is: the user provides an audio input to the agent, and the agent generates an audio reply based on a text instruction. (I’ve configured RunConfig(response_modalities=["AUDIO"], input_audio_transcription=types.AudioTranscriptionConfig(), output_audio_transcription=types.AudioTranscriptionConfig).)

I would like to get the number of text/audio tokens for each model input and output so I can calculate costs. However, when I print event.usage_metadata, I only get the following:

cache_tokens_details=None cached_content_token_count=None
candidates_token_count=None candidates_tokens_details=None
prompt_token_count=1549
prompt_tokens_details=[
  ModalityTokenCount(modality=TEXT, token_count=1526),
  ModalityTokenCount(modality=AUDIO, token_count=23)
]
thoughts_token_count=104
tool_use_prompt_token_count=None
tool_use_prompt_tokens_details=None
total_token_count=1615
traffic_type=None

I can’t tell which tokens belong to the input audio and which belong to the model’s audio output.
On top of that, the sum of prompt_token_count and thoughts_token_count is greater than total_token_count.

So I’m not sure whether usage_metadata is inaccurate, or if I’m looking in the wrong place to get the information I need.

I’d also like to know whether enabling input_audio_transcription=types.AudioTranscriptionConfig() and output_audio_transcription=types.AudioTranscriptionConfig() results in any additional charges?

Srikanta_K_N · January 2, 2026, 9:15am

Hi @fengdog, welcome to the community!

Please refer here for a guide on token calculation: How token works

And no, it doesn’t cost any additional charges if you enable input_audio_transcription=types.AudioTranscriptionConfig() and output_audio_transcription=types.AudioTranscriptionConfig()

Thank you!

Gianluca_Emaldi · January 3, 2026, 10:43am

Hi @fengdog,
prompt_token_count are the input tokens and prompt_tokens_details are the details, where modality=TEXT I think are the text tokens of your instructions, or the system prompt, and modality=AUDIO is the user audio input.

It seems to me you are missing the output tokens. For Native Audio model the field name is response_token_count. For token usage of Native Audio model refer to these docs:

and

Anyway, I think there are problems in token count for Native Audio model. It’s been 2 months since I reported this issue: Gemini Live API Reports Triple Prompt Token Consumption

But I haven’t received any response. I hope Google can provide an answer as soon as possible.

I hope I was helpful.

Ciao

fengdog · January 6, 2026, 2:20am

Thank you, Srikanta_K_N. Based on the documentation you provided, I now understand that the values in prompt_tokens_details include the token counts for my system prompt and the user’s audio input, so they can be used to correctly calculate the input cost. But what’s strange is that candidates_token_count is None — I hope that doesn’t mean I have to calculate the audio length and convert it into tokens myself, haha!

Also, thanks — now I can confidently use input_audio_transcription and output_audio_transcription!

Thank you.

fengdog · January 6, 2026, 2:40am

Thank you for your reply, @Gianluca_Emaldi — your response was very helpful. I’m now able to confirm that the input token calculation is working correctly. And as you mentioned, modality=TEXT corresponds to my instruction, while modality=AUDIO is the user’s input audio.

The remaining issue is that I’m unable to obtain the output token count. In my example, I printed the full event, but I didn’t see the responseTokenCount field that’s described in the Live API documentation. I’m wondering whether the problem might be related to the ADK agent framework.

Also, regarding the token calculation issue you reported with the Live API — I had never really noticed it before, because in my use case the input token values were usually close to what I expected. But from your findings, it does seem like quite a serious problem. Hopefully Google can fix it soon!

Joe_Hu · January 9, 2026, 3:59am

hi, regarding live api, besides usagemetadata, how can I see how many output tokens you bill for in each session? right now I have to calculate it myself with 32 tokens/s for audio, and I cannot verify the usage accurately.

Topic		Replies	Views
Audio Token Counts Unexpectedly Low in Gemini Live API Gemini API gemini-api , prompt	3	115	January 13, 2026
Gemini Live API Reports Triple Prompt Token Consumption Gemini API gemini-api , live-streaming	3	190	January 6, 2026
Could someone help me understand gemini live pricing? Gemini API api , models , billing	1	337	June 23, 2025
Pricing and usages for S2S (speech to speech) models Gemini API gemini , audio	5	115	November 28, 2025
Calculating cost for a single Gemini 2 request with audio and text Gemini API api , billing	1	209	April 30, 2025

Token usage calculation with Google ADK and Gemini-2.5-flash-native-audio-dialog

Related topics