## Issue
Using `gemini-2.0-flash-live-001` via LiveKit. 2-minute voice conversation shows:
- Audio input: **3 tokens** (seems too low)
- Audio output: **0 tokens** (agent is speaking!)
- Text tokens: 13,521 input(normal,system prompt), 74 output (I set audio output, should have been zero here.)
## Questions
1. Is 3 audio input 3 chunks of audio rather?
2. Why 0 audio output tokens when audio is playing?
3. Do text output tokens (74) represent audio output actually, but still too small.
4. What’s expected for a 2-min voice conversation?
Need to understand this for accurate user billing.
## here is my Code snippet:
```python
# Model setup
google.beta.realtime.RealtimeModel(
model=“gemini-2.0-flash-live-001”,
voice=“Leda”,
input_audio_transcription=AudioTranscriptionConfig(),
output_audio_transcription=AudioTranscriptionConfig(),
)
# Metrics
@session.on(“metrics_collected”)
def _on_metrics_collected(ev):
inp = ev.metrics.input_token_details
out = ev.metrics.output_token_details
# inp.audio_tokens = 3, out.audio_tokens = 0
```I really appreciate whoever can look into this and clarify things up. I have been troubled for quite a while and seeking answers around in vain.