Hi,
We’re building a production app on the Gemini Live API (gemini-3.1-flash-live-preview, persistent WebSocket, audio S2S) and need billing clarification:
-
The pricing page says you’re charged per turn for all tokens in the context window. Does this re-billing apply to audio tokens from previous turns, or only text (transcriptions, system prompt)? At what rate?
-
Does the per-minute price ($0.005/min input, $0.018/min output) already include context re-processing, or is context re-billing charged on top?
-
With contextWindowCompression (trigger: 25k, sliding window: 8k) - after compression, are subsequent turns billed for ~8k context tokens, or the full pre-compression amount?
-
With inputAudioTranscription and outputAudioTranscription enabled - does the context window store raw audio tokens from previous turns, or only text transcriptions?
-
When inputAudioTranscription and outputAudioTranscription are enabled - are we billed for both the audio tokens and the transcription text tokens? Or is transcription included in the audio token price?
Thanks
Hello,
any info regarding my questions?
Cheers
Hi Damian!
Apologies in the delayed response. I have updated the Live API best practices to include pricing and billing. Would love your feedback if the documentation updates made things a bit more clear for you: https://ai.google.dev/gemini-api/docs/live-api/best-practices#pricing-billing
Attempting to also cover your questions here below:
-
Yes, re-billing applies to all tokens, including raw audio from previous turns. Because Gemini is natively multimodal, it doesn’t convert past audio to text to save space, instead it retains the original audio tokens to preserve tone, emotion, and conversational nuance. You are billed for these accumulated audio tokens at the standard audio input token rate on every single turn.
-
There is no flat per-minute price for the Live API. Gemini bills strictly by the token. Because tokens accumulate in the persistent WebSocket session, the cost scales with the length of the conversation — a 10-second interaction at the end of a long session costs significantly more than a 10-second interaction at the start.
-
Subsequent turns are billed only for the compressed amount (~8k tokens) plus the new tokens for that turn. Once the sliding window compresses the history, older tokens are evicted from the active context. This stops the compounding re-billing cost for the evicted tokens.
-
The context window stores raw audio tokens, not the text transcriptions. Transcriptions are generated purely as a separate payload for the frontend application (e.g., for UI logs or closed captions); they do not replace the audio tokens in the model’s memory.
-
You are billed for both audio tokens and transcription text tokens. Transcription is not included in the base audio token price. When those flags are enabled, the API runs a parallel process to generate text. As the billing rule states, “all text tokens generated for transcription are charged at the text token output rate” on top of the audio token costs.
Please let me know if this helps or if you have more questions - I’m happy to help debug.
@Alisa_Fortin
Thanks, this helped a lot.
We ran a few live tests with context_window_compression and found an important difference between text input and real audio input.
For realtimeInput.text, low thresholds work as expected:
trigger_tokens=6000, target_tokens=2000
promptTokenCount dropped from ~5k to ~2k
trigger_tokens=4000, target_tokens=2000 also worked
But with real S2S audio input (realtimeInput.audio, PCM 16 kHz), we could not reproduce the same behavior:
trigger_tokens=6000, target_tokens=2500
- prompt grew past 6k and reached ~8.2k
- no drop observed
trigger_tokens=4000, target_tokens=2000
- prompt grew past 4k and reached ~5.2k
- no drop observed
trigger_tokens=12000, target_tokens=8000
- prompt grew past 12k and reached ~16.2k
- no drop observed
This is important for production S2S apps. Because previous raw audio is re-billed on every turn, we sometimes need to keep the active context very low. Otherwise costs grow too quickly. Today we handle this manually by closing the Live session, summarizing the conversation, and starting a new session around ~6k prompt tokens.
Questions:
-
Is context_window_compression expected to work for accumulated raw audio tokens in Live API sessions, or only reliably for text context?
-
Are there undocumented minimum values for trigger_tokens / target_tokens when the active context contains audio? For example, should 6000 -> 2500 be expected to work?
-
If this is a bug or limitation, is there a planned fix? It would be very helpful to know whether we can eventually rely on automatic compression for real audio S2S sessions.
Feature requests:
-
Audio context caching, if technically possible for Live API. Re-billing previous raw audio on every turn is the main cost driver.
-
An option to store previous turns as text transcripts instead of raw audio, or to disable retaining previous raw audio entirely. We understand this may reduce output quality, tone awareness, and conversational nuance, but for many production use cases the cost control tradeoff would be worth it.
In short: for real production S2S we need separate context-management strategies for text and audio. Text compression seems to work at low thresholds, but audio context either behaves differently or does not compressat those thresholds.
Damian