How does Gemini Realtime API handle billing for audio input reused in conversation history, and how do cached tokens work in this context?

Hello,

I’m using the Gemini Realtime / Live API with audio input. I understand that billing is based on input tokens, output tokens, and cached tokens. I have two related questions:

Audio input reused in conversation history

Suppose I say “Hello” via audio in the first turn. That counts as audio input billing.

Later, I say “How’s the weather today?” via audio in the second turn.
At this point, the conversation history includes the first “Hello” as context.

:backhand_index_pointing_right: Does the first “Hello” continue to be counted as audio tokens in subsequent turns, or is it internally converted to text tokens and reused as cached tokens (so that it is not billed again as audio input)?

Cached tokens in Realtime API
In the above scenario, it seems no cached tokens are generated (since each turn is just fresh audio input). Could you clarify:

How exactly cached tokens are applied in the Realtime API context?

Are cached tokens relevant for conversation history reuse, or only for repeated prompts in stateless calls?

Is there a recommended way to leverage cached tokens in Realtime sessions to reduce cost, especially when conversation history grows long?

Thanks in advance for clarification!