I built an application using Google ADK with the model gemini-2.5-flash-native-audio-preview-12-2025, and I’m having trouble calculating token usage.
My usage pattern is: the user provides an audio input to the agent, and the agent generates an audio reply based on a text instruction. (I’ve configured RunConfig(response_modalities=["AUDIO"], input_audio_transcription=types.AudioTranscriptionConfig(), output_audio_transcription=types.AudioTranscriptionConfig).)
I would like to get the number of text/audio tokens for each model input and output so I can calculate costs. However, when I print event.usage_metadata, I only get the following:
I can’t tell which tokens belong to the input audio and which belong to the model’s audio output.
On top of that, the sum of prompt_token_count and thoughts_token_count is greater than total_token_count.
So I’m not sure whether usage_metadata is inaccurate, or if I’m looking in the wrong place to get the information I need.
I’d also like to know whether enabling input_audio_transcription=types.AudioTranscriptionConfig() and output_audio_transcription=types.AudioTranscriptionConfig() results in any additional charges?
Please refer here for a guide on token calculation: How token works
And no, it doesn’t cost any additional charges if you enable input_audio_transcription=types.AudioTranscriptionConfig() and output_audio_transcription=types.AudioTranscriptionConfig()
Hi @fengdog, prompt_token_count are the input tokens and prompt_tokens_details are the details, where modality=TEXT I think are the text tokens of your instructions, or the system prompt, and modality=AUDIO is the user audio input.
It seems to me you are missing the output tokens. For Native Audio model the field name is response_token_count. For token usage of Native Audio model refer to these docs:
Thank you, Srikanta_K_N. Based on the documentation you provided, I now understand that the values in prompt_tokens_details include the token counts for my system prompt and the user’s audio input, so they can be used to correctly calculate the input cost. But what’s strange is that candidates_token_count is None — I hope that doesn’t mean I have to calculate the audio length and convert it into tokens myself, haha!
Also, thanks — now I can confidently use input_audio_transcription and output_audio_transcription!
Thank you for your reply, @Gianluca_Emaldi — your response was very helpful. I’m now able to confirm that the input token calculation is working correctly. And as you mentioned, modality=TEXT corresponds to my instruction, while modality=AUDIO is the user’s input audio.
The remaining issue is that I’m unable to obtain the output token count. In my example, I printed the full event, but I didn’t see the responseTokenCount field that’s described in the Live API documentation. I’m wondering whether the problem might be related to the ADK agent framework.
Also, regarding the token calculation issue you reported with the Live API — I had never really noticed it before, because in my use case the input token values were usually close to what I expected. But from your findings, it does seem like quite a serious problem. Hopefully Google can fix it soon!
hi, regarding live api, besides usagemetadata, how can I see how many output tokens you bill for in each session? right now I have to calculate it myself with 32 tokens/s for audio, and I cannot verify the usage accurately.