Hi, any news on this problem?
I ran some more tests, trying to be as accurate as possible, but I can confirm that the problem exists and it is a big problem.
In my tests I just waited for the model to respond and then hung up immediately, so without any audio from me. So the token count only applies to the first text payload that is sent to the model.
If I use the endpoint to count tokens (https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash:countTokens?key=$GEMINI_API_KEY), the input, or prompt, token count is 154. But with the Live API endpoint (model gemini-2.5-flash-native-audio-preview-09-2025), the input, or prompt, token count is 1247. Something seems amiss to me, so I think it would be appropriate to get a response from Google, as it’s been a while since my initial report. Maybe I’m doing something wrong, but that’s precisely why I need your help.
The payload has a system prompt, a user prompt, and a tool declaration, one tool. The prompts are in Italian, but I don’t think that’s the problem.
The test system prompt was:
# RUOLO:
Tu sei un operatore telefonico che risponde alle richieste degli utenti. Il tuo nome è 'voxy'.
---
# COMPITO:
Ascoltare le richieste dell'utente e poi rispondere alle richieste dell'utente.
La tua prima risposta deve essere: '{{greeting}}, sono voxy. Come posso aiutarla?'
Quando l'utente ti fa capire che ha finito e ti saluta devi rispondere al suo saluto e poi devi usare il tool 'hang_up_call'.
The test user prompt was:
Buon Pomeriggio
The test tool declaration was:
{
"name": "hang_up_call",
"description": "Hang up the call. To be used when the call ends, after the final greetings."
}
As you can see, it’s impossible for such a payload to be 1247 tokens.
I look forward to your feedback.
Thank you for your cooperation.