How does Gemini Realtime API handle billing for audio input reused in conversation history, and how do cached tokens work in this context?

Bill_Armstrong · October 6, 2025, 2:26am

Hello,

I’m using the Gemini Realtime / Live API with audio input. I understand that billing is based on input tokens, output tokens, and cached tokens. I have two related questions:

Audio input reused in conversation history

Suppose I say “Hello” via audio in the first turn. That counts as audio input billing.

Later, I say “How’s the weather today?” via audio in the second turn.
At this point, the conversation history includes the first “Hello” as context.

Does the first “Hello” continue to be counted as audio tokens in subsequent turns, or is it internally converted to text tokens and reused as cached tokens (so that it is not billed again as audio input)?

Cached tokens in Realtime API
In the above scenario, it seems no cached tokens are generated (since each turn is just fresh audio input). Could you clarify:

How exactly cached tokens are applied in the Realtime API context?

Are cached tokens relevant for conversation history reuse, or only for repeated prompts in stateless calls?

Is there a recommended way to leverage cached tokens in Realtime sessions to reduce cost, especially when conversation history grows long?

Thanks in advance for clarification!

Topic		Replies	Views
Gemini Live Caching Gemini API audio , context_caching	6	215	March 24, 2026
Pricing of Speech to Speech live model Gemini API gemini-api , audio	1	56	April 27, 2026
Question about Gemini API caching pricing Gemini API api , billing	1	317	November 6, 2025
Pricing and usages for S2S (speech to speech) models Gemini API gemini , audio	5	242	November 28, 2025
Live API Pricing - Audio tokens / second & silent audio Gemini API gemini-api , live-streaming	2	437	July 6, 2025

How does Gemini Realtime API handle billing for audio input reused in conversation history, and how do cached tokens work in this context?

Related topics