Hi,
I’m using Gemini TTS with the SSE streaming endpoint:
POST https://generativelanguage.googleapis.com/v1beta/models/gemini-3.1-flash-tts-preview:streamGenerateContent?alt=sse
Request shape is roughly:
{
"contents": [
{
"role": "user",
"parts": [
{
"text": "Hebrew text prompt here..."
}
]
}
],
"generationConfig": {
"responseModalities": ["AUDIO"],
"speechConfig": {
"voiceConfig": {
"prebuiltVoiceConfig": {
"voiceName": "Kore"
}
}
}
}
}
Most calls work correctly, but sometimes I get HTTP 200 and one audio chunk only. The audio is valid, but it is truncated.
Example response, API key removed:
endpoint: streamGenerateContent?alt=sse
url: https://generativelanguage.googleapis.com/v1beta/models/gemini-3.1-flash-tts-preview:streamGenerateContent?alt=sse&key=<REDACTED>
http_status: 200
Response headers:
{
"Content-Type": "text/event-stream",
"Content-Disposition": "attachment",
"Vary": "Origin, X-Origin, Referer",
"Transfer-Encoding": "chunked",
"Date": "Thu, 14 May 2026 10:52:34 GMT",
"Server": "scaffolding on HTTPServer2",
"X-XSS-Protection": "0",
"X-Frame-Options": "SAMEORIGIN",
"X-Content-Type-Options": "nosniff",
"Server-Timing": "gfet4t7; dur=7309",
"Alt-Svc": "h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000"
}
SSE data:
data: {"candidates":[{"content":{"parts":[{"inlineData":{"mimeType":"audio/l16; rate=24000; channels=1","data":"<BASE64_AUDIO_REDACTED chars=1280000>"}}],"role":"model"},"index":0}],"usageMetadata":{"promptTokenCount":217,"candidatesTokenCount":640,"totalTokenCount":857,"promptTokensDetails":[{"modality":"TEXT","tokenCount":217}],"candidatesTokensDetails":[{"modality":"AUDIO","tokenCount":640}],"serviceTier":"standard"},"modelVersion":"gemini-3.1-flash-tts-preview","responseId":"a6kFauDlAozunsEPzaPjkAw"}
data: {"candidates":[{"content":{},"finishReason":"OTHER","index":0}],"usageMetadata":{"promptTokenCount":217,"candidatesTokenCount":640,"totalTokenCount":857,"promptTokensDetails":[{"modality":"TEXT","tokenCount":217}],"candidatesTokensDetails":[{"modality":"AUDIO","tokenCount":640}],"serviceTier":"standard"},"modelVersion":"gemini-3.1-flash-tts-preview","responseId":"a6kFauDlAozunsEPzaPjkAw"}
The suspicious pattern is:
inlineData.data base64 length = 1,280,000 chars
decoded PCM bytes ≈ 960,000
audio format = audio/l16; rate=24000; channels=1
duration = 960,000 / (24,000 * 2) = exactly 20.00 seconds
candidatesTokenCount = 640
audio_chunks = 1
finishReason = OTHER
When this happens and there is no second inlineData.data event after it, the generated audio is usually cut off around 20 seconds. If I retry the exact same request, I often get a longer response with two audio chunks and finishReason: STOP, for example around 29–31 seconds, and the full text is spoken.
Questions:
- Is
1,280,000base64 chars / 20.00 seconds / 640 audio tokens a known internal chunk limit or partial-response behavior for Gemini TTS SSE? - Is
finishReason: OTHERexpected for successful TTS streams, or should it be treated as a retry signal when the response has only one 20-second chunk? - Is there a recommended way to detect this reliably without running speech-to-text verification on the returned audio?
- Should the client retry automatically when it sees this exact pattern?
Currently I treat this as suspicious and verify the audio transcript before accepting it, but I’d like to know whether there is an official signal or recommended client behavior for this case.
Thanks!