Audio Token Counts Unexpectedly Low in Gemini Live API

## Issue

Using `gemini-2.0-flash-live-001` via LiveKit. 2-minute voice conversation shows:

- Audio input: **3 tokens** (seems too low)

- Audio output: **0 tokens** (agent is speaking!)

- Text tokens: 13,521 input(normal,system prompt), 74 output (I set audio output, should have been zero here.)

## Questions

1. Is 3 audio input 3 chunks of audio rather?

2. Why 0 audio output tokens when audio is playing?

3. Do text output tokens (74) represent audio output actually, but still too small.

4. What’s expected for a 2-min voice conversation?

Need to understand this for accurate user billing.

## here is my Code snippet:

```python

# Model setup

google.beta.realtime.RealtimeModel(

model=“gemini-2.0-flash-live-001”,

voice=“Leda”,

input_audio_transcription=AudioTranscriptionConfig(),

output_audio_transcription=AudioTranscriptionConfig(),

)

# Metrics

@session.on(“metrics_collected”)

def _on_metrics_collected(ev):

inp = ev.metrics.input_token_details

out = ev.metrics.output_token_details

# inp.audio_tokens = 3, out.audio_tokens = 0

```I really appreciate whoever can look into this and clarify things up. I have been troubled for quite a while and seeking answers around in vain.

Hi @JamesIntallaga Apologies for late response
Thanks for the heads-up. This error could be due to faulty LiveKit integration with gemini-2.0-flash-live-001model. Since this model is being deprecated on December 9th, please confirm if this same issue is present in any other Live Model**.**

When using the Gemini Live API (gemini-2.5-flash-native-audio-preview-09-2025 model) directly via WebSocket (not LiveKit), the usageMetadata.responseTokenCount values appear significantly lower than expected for audio output.

Environment:

  • Model: models/gemini-2.5-flash-native-audio-preview-09-2025

  • API: Direct WebSocket connection to Live API

  • responseModalities: [“AUDIO”]

Observed Behavior: In a conversation with 5+ AI audio responses totaling approximately 60-90 seconds of spoken audio, I receive:

  • responseTokenCount: 380 (which at 32 tokens/sec = only ~12 seconds)

  • responseTokensDetails: [{modality: “AUDIO”, tokenCount: 380}]

    how do I know how many tokens I am being billed for output? It isn’t very clear in google cloud console or ai studio (please guide me if I am wrong).

Hi @Sanjay_Shreeyans_Jav ,
Thanks for reporting this issue.
To help us debug this issue, could you please provide minimal reproducible code.