Gemini Live API Reports Triple Prompt Token Consumption

For a couple of days now, the Gemini Live API has been reporting a prompt token consumption that’s three times higher than the actual count. I’m having the same problem with both the gemini-live-2.5-flash-preview and gemini-2.5-flash-native-audio-preview-09-2025 models.
The initial prompt tokens count is the sum of a system prompt, the grounding documentation that I pass as a media file, a tool, which is a function defined only with a name and description, and an initial text user prompt, which is a greeting, such as Good morning, that I pass to the model to give the user the illusion that the model is ‘picking up the phone, saying hello, and introducing itself.’
The grounding documentation, i.e.. the media file, is a Markdown file, so it’s a text type file.

If with the same payload, so system prompt + media file + tool + greeting user prompt, I query the token counting endpoint https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash:countTokens, the total number of prompt tokens is 8431.
If I use the content of the media file as the user prompt (+ greeting) instead of passing it as ‘media’, the token count increases by 1: 8432 (which is consistent with the previous result).

Instead, using the Gemini Live API, with the same system prompt + media file + tool + greeting user prompt, triple the actual number of tokens is counted immediately—that is, before I even start speaking:

  • 23,842 tokens with the gemini-live-2.5-flash-preview model
  • 24,293 tokens with the gemini-2.5-flash-native-audio-preview-09-2025 model

At the very least, I would expect both models to report the same number of tokens, even if it’s incorrect. But they don’t even agree with each other, and the difference of about 450 tokens is significant.

I noticed this because the total token consumption for each individual test was usually around 30,000-40,000 tokens. But yesterday, in one session, it skyrocketed to 260,000. Even today, it’s never less than 130,000/140,000.
A test session usually consists of 4 or 5 turns, so as a maximum count of prompt tokens I would expect 8431 * 5 = about 40000/45000, not greater than 100000!
I ran some tests by printing the usage metadata and found that the prompt tokens are the culprits.

I also noticed that the token count for the first response sentence the model must say at the start of the session, which, as per the system prompt instructions, is always the same and is “Buongiorno, Grand Hotel Apòsa, sono Anna. Come posso aiutarla?” (from italian: “Good morning, Grand Hotel Apòsa, this is Anna. How can I help you?”), is counted differently:

  • 18 tokens with the gemini-live-2.5-flash-preview model
  • 104 tokens with the gemini-2.5-flash-native-audio-preview-09-2025 model

But this is a less serious problem :slight_smile:

For completeness, the language of the application is Italian.

Can you check if there are any problems with the token count for the Gemini Live API?

Thank you for your cooperation.

2 Likes

Hi, any news on this problem?

I ran some more tests, trying to be as accurate as possible, but I can confirm that the problem exists and it is a big problem.

In my tests I just waited for the model to respond and then hung up immediately, so without any audio from me. So the token count only applies to the first text payload that is sent to the model.

If I use the endpoint to count tokens (https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash:countTokens?key=$GEMINI_API_KEY), the input, or prompt, token count is 154. But with the Live API endpoint (model gemini-2.5-flash-native-audio-preview-09-2025), the input, or prompt, token count is 1247. Something seems amiss to me, so I think it would be appropriate to get a response from Google, as it’s been a while since my initial report. Maybe I’m doing something wrong, but that’s precisely why I need your help.

The payload has a system prompt, a user prompt, and a tool declaration, one tool. The prompts are in Italian, but I don’t think that’s the problem.

The test system prompt was:

# RUOLO:

Tu sei un operatore telefonico che risponde alle richieste degli utenti. Il tuo nome è 'voxy'.

---

# COMPITO:

Ascoltare le richieste dell'utente e poi rispondere alle richieste dell'utente.

La tua prima risposta deve essere: '{{greeting}}, sono voxy. Come posso aiutarla?'

Quando l'utente ti fa capire che ha finito e ti saluta devi rispondere al suo saluto e poi devi usare il tool 'hang_up_call'.

The test user prompt was:

Buon Pomeriggio

The test tool declaration was:

{
  "name": "hang_up_call",
  "description": "Hang up the call. To be used when the call ends, after the final greetings."
}

As you can see, it’s impossible for such a payload to be 1247 tokens.

I look forward to your feedback.
Thank you for your cooperation.

Hello,
Do you have any updates on this issue? I also tested with the Python GenAI SDK, and the result is the same.
With the system prompt “You are a helpful AI assistant. Answer concisely.” and the prompt “Good Morning.” the prompt_token_count is 369, when it should actually be around 15.

The full usage_metadata is:

"usage_metadata": {
    "prompt_token_count": 369,
    "response_token_count": 81,
    "thoughts_token_count": 78,
    "total_token_count": 450,
    "prompt_tokens_details": [
      {
        "modality": "TEXT",
        "token_count": 369
      }
    ],
    "response_tokens_details": [
      {
        "modality": "AUDIO",
        "token_count": 81
      }
    ]
  }

If it’s helpful, I can attach the Python code I used.

Thank you for your cooperation.

Hello,

Thank you for using the forum. We were able to reproduce the issue with the help of the instructions you provided, and we will pass this information on to the Gemini development team.

1 Like