Hi @fengdog,
prompt_token_count are the input tokens and prompt_tokens_details are the details, where modality=TEXT I think are the text tokens of your instructions, or the system prompt, and modality=AUDIO is the user audio input.
It seems to me you are missing the output tokens. For Native Audio model the field name is response_token_count. For token usage of Native Audio model refer to these docs:
and
Anyway, I think there are problems in token count for Native Audio model. It’s been 2 months since I reported this issue: Gemini Live API Reports Triple Prompt Token Consumption
But I haven’t received any response. I hope Google can provide an answer as soon as possible.
I hope I was helpful.
Ciao