Gemini 2.5 TTS workflow questions

Looking at pricing and quota for 2.5 Flash/Pro TTS, I’m trying to figure out the best way to use it.

As I understand it, audio tokens for TTS output are used at 32 tokens per second of audio, so 1920 per minute. Max output from both TTSs is 16,000 tokens, which is about 8.33 minutes. So if I estimate my input text to be under 8 minutes of audio (probably should err conservatively), then I can do that all in one turn. But over that, I should break it into a few chunks.

Main question, is the 32 tokens per second audio output correct? And any thing I am missing in this simple workflow?

Thank you for your inquiry. Your understanding is mostly correct.

For Audio Durations Over 8.33 Minutes, I recommend to break the input text into smaller segments to stay within the token limit and ensure each segment is coherent and maintains context to provide a natural flow in the generated speech.

For more detailed information on pricing and quotas, please refer to the official Gemini Pricing Documentation.

Thanks for using the AI forum :slight_smile: