Summary: On gemini-3.1-flash-tts-preview, the SSE streaming endpoint (:streamGenerateContent?alt=sse) intermittently returns partial audio + finishReason: OTHER (HTTP 200) once the generation exceeds ~60s of audio. The exact same prompt through non-streaming :generateContent returns the full audio with finishReason: STOP every time. This bills AUDIO tokens for unusable output, with no error surfaced to the client.
Repro (raw REST, single-speaker, fr-FR voice “Leda”). Same request body, only the endpoint differs:
| text | ~audio | :streamGenerateContent (3 trials) | :generateContent |
|---|---|---|---|
| 50 words | ~20s | STOP / STOP / STOP | STOP |
| 100 words | ~40s | STOP / STOP / STOP | STOP |
| 150 words | ~57s | STOP / STOP / STOP | STOP |
| 200 words | ~70s | OTHER / STOP / STOP | STOP |
| 250 words | ~89s | OTHER / STOP / STOP | STOP |
| 300 words | ~106s | OTHER / OTHER / OTHER | STOP |
| 350+ words | ~125s | OTHER / OTHER / OTHER | STOP (full ~136s) |
Streaming reliably truncates once the audio passes ~60-70s; non-streaming has no such cliff. Failures arrive as one/a few PCM chunks then finishReason: OTHER, HTTP 200.
The confusing part: the Gemini API TTS docs state “TTS does not support streaming” under Limitations, yet :streamGenerateContent accepts the request, returns 200, and bills AUDIO tokens, just with truncated output. What is the supported production path for long-form streaming TTS?
Environment: model gemini-3.1-flash-tts-preview; reproduced on both Vertex AI (generateContentStream) and the Gemini Developer API (streamGenerateContent?alt=sse); single-speaker; temperature omitted and 0.6 both reproduce.
Impact: production museum audio-guide product with long-form narration. We cannot ship the streaming path. Non-streaming works but a single ~136s generation takes ~77s wall-time, too slow for interactive playback. So today neither path is viable for >~1 min narration.
Related reports:
- Gemini 3.1 Flash TTS SSE sometimes returns exactly 20s / 1,280,000 base64 chars and truncated audio
- Gemini 3.1 Flash Live - Voice slowly changing, massive audio quality + volume dropping on TTS requests longer than ~1 minute
- Gemini 2.5 Flash TTS streaming?
Could the Gemini API / TTS team confirm whether streaming TTS is supported and route the truncation? Happy to share full request/response payloads and responseIds.