Does Gemini Live native-audio bill cumulative prompt tokens on every turn? Cost seems to scale with turn count, not call duration.

I’m building a real-time voice application on the Gemini Live native-audio (speech-to-speech) model and I’m trying to understand the billing behaviour, because cost does not track call duration the way I expected.

What I’m seeing

Gemini Live returns a cumulative promptTokenCount that grows on every turn within a session. When I reconcile my logged token usage against what I was actually billed, the numbers only line up if the prompt context is billed again on each turn — i.e. cost scales with the number of conversational turns, not with how long the call lasts.

To check this, I re-priced a set of calls three ways from the raw per-turn usageMetadata:

Interpretation Result vs. actual bill
Price only the final snapshot per call ~8.6× too low
Price the single largest aggregate snapshot ~1.6× too low
Sum every per-turn snapshot matches within ~3%

Only the “sum every turn” interpretation reproduces the actual charge. That strongly suggests the cumulative context is re-billed each turn.

Why this is a problem

Because cost tracks turn count rather than duration, calls of nearly identical length can cost very different amounts. Two examples from my own data (durations rounded):

  • A 5.8-minute call cost ~37% less than a 5.6-minute call — the longer one was cheaper.
  • Two calls of ~7.1 minutes each differed in cost by ~31%.

Sample of the pattern (turns = number of usageMetadata snapshots in the session):

Call Duration (min) Turns Final prompt tokens Relative cost
A 0.4 2 ~8.6k very low
B 1.8 6 ~59k low
C 5.6 20 ~160k high
D 5.8 22 ~222k medium
E 7.1 29 ~261k high
F 3.3 12 ~110k medium

The turn count and cumulative token growth predict cost far better than duration does.

Questions for the community / team

  1. Can anyone confirm whether Gemini Live native audio bills the cumulative prompt context on every turn? Is that the intended, documented behaviour?
  2. If so, what’s the recommended way to keep per-session cost predictable — e.g. context truncation, session resets, capping turns, or any billing setting I’ve missed?
  3. Is there official documentation that describes exactly how per-turn usageMetadata maps to billed tokens for the Live API?

I’m happy to share more anonymized per-turn usage data if it helps reproduce this. Mainly trying to understand whether this is expected behaviour and how others are managing predictability with the Live audio model.