I’m building a real-time voice application on the Gemini Live native-audio (speech-to-speech) model and I’m trying to understand the billing behaviour, because cost does not track call duration the way I expected.
What I’m seeing
Gemini Live returns a cumulative promptTokenCount that grows on every turn within a session. When I reconcile my logged token usage against what I was actually billed, the numbers only line up if the prompt context is billed again on each turn — i.e. cost scales with the number of conversational turns, not with how long the call lasts.
To check this, I re-priced a set of calls three ways from the raw per-turn usageMetadata:
| Interpretation | Result vs. actual bill |
|---|---|
| Price only the final snapshot per call | ~8.6× too low |
| Price the single largest aggregate snapshot | ~1.6× too low |
| Sum every per-turn snapshot | matches within ~3% |
Only the “sum every turn” interpretation reproduces the actual charge. That strongly suggests the cumulative context is re-billed each turn.
Why this is a problem
Because cost tracks turn count rather than duration, calls of nearly identical length can cost very different amounts. Two examples from my own data (durations rounded):
- A 5.8-minute call cost ~37% less than a 5.6-minute call — the longer one was cheaper.
- Two calls of ~7.1 minutes each differed in cost by ~31%.
Sample of the pattern (turns = number of usageMetadata snapshots in the session):
| Call | Duration (min) | Turns | Final prompt tokens | Relative cost |
|---|---|---|---|---|
| A | 0.4 | 2 | ~8.6k | very low |
| B | 1.8 | 6 | ~59k | low |
| C | 5.6 | 20 | ~160k | high |
| D | 5.8 | 22 | ~222k | medium |
| E | 7.1 | 29 | ~261k | high |
| F | 3.3 | 12 | ~110k | medium |
The turn count and cumulative token growth predict cost far better than duration does.
Questions for the community / team
- Can anyone confirm whether Gemini Live native audio bills the cumulative prompt context on every turn? Is that the intended, documented behaviour?
- If so, what’s the recommended way to keep per-session cost predictable — e.g. context truncation, session resets, capping turns, or any billing setting I’ve missed?
- Is there official documentation that describes exactly how per-turn
usageMetadatamaps to billed tokens for the Live API?
I’m happy to share more anonymized per-turn usage data if it helps reproduce this. Mainly trying to understand whether this is expected behaviour and how others are managing predictability with the Live audio model.