Understanding Gemini Multimodal Live context and pricing

The pricing for audio input is listed as $0.70/million tokens vs $0.10/million tokens for text. How does this work in practice with the context window? Is the current input audio priced at $0.70/mill and the rest of the context in text priced at $0.10/mill, or is the entire context priced at $0.70?