Gemini 2.5 TTS workflow questions

William_Kelley · June 6, 2025, 7:05am

Looking at pricing and quota for 2.5 Flash/Pro TTS, I’m trying to figure out the best way to use it.

As I understand it, audio tokens for TTS output are used at 32 tokens per second of audio, so 1920 per minute. Max output from both TTSs is 16,000 tokens, which is about 8.33 minutes. So if I estimate my input text to be under 8 minutes of audio (probably should err conservatively), then I can do that all in one turn. But over that, I should break it into a few chunks.

Main question, is the 32 tokens per second audio output correct? And any thing I am missing in this simple workflow?

Krish_Varnakavi1 · June 6, 2025, 9:54pm

Thank you for your inquiry. Your understanding is mostly correct.

For Audio Durations Over 8.33 Minutes, I recommend to break the input text into smaller segments to stay within the token limit and ensure each segment is coherent and maintains context to provide a natural flow in the generated speech.

For more detailed information on pricing and quotas, please refer to the official Gemini Pricing Documentation.

Thanks for using the AI forum

Topic		Replies	Views
Gemini 2.5 Flash Preview TTS Gemini API gemini-flash , billing	1	307	July 21, 2025
Are audio output tokens equal to text output tokens? Gemini API api , models , audio	1	334	May 22, 2025
Pricing and usages for S2S (speech to speech) models Gemini API gemini , audio	5	392	November 28, 2025
How Do I Accurately Calculate Gemini 2.5 Pro API Pricing? Google AI Studio api , billing	2	1546	January 23, 2026
Gemini Flash TTS speed? hows your experience? Gemini API gemini-api	1	948	June 11, 2025

Gemini 2.5 TTS workflow questions

Related topics