We’ve been running Gemini TTS multi-speaker mode in production for our podcast SaaS platform for three weeks now and have documented seven reproducible issues. We’ve deployed 20+ workarounds on our end and logged 34 separate incidents. Sharing here in case others are hitting the same problems, and hoping the Gemini team can confirm whether these are known and being tracked.
Models tested: gemini-2.5-flash-tts, gemini-2.5-pro-tts, gemini-3.1-pro-preview SDK: @googlegoogle/genai (TypeScript), both Developer API and Vertex AI
1. Non-deterministic audio truncation (critical)
The API returns finishReason: 'OTHER' instead of 'STOP' and stops generating audio mid-stream. We see output containing only 13-46% of expected duration on chunks well under the 2000-character limit.
Example: 1078-char input, expected ~72s output, got 23.99s (33%). Same input succeeds on retry, confirming it’s non-deterministic on the API side.
We estimate 30-40% of our API calls are wasted on retries. We’ve built a 3-layer detection system (finishReason check, duration ratio validation, adaptive chunk splitting) but this is a significant cost and reliability issue.
2. Safety filters silently truncate legitimate content (high)
Default safety filters block standard news podcast scripts about topics like crime, politics, and health. The response returns finishReason: 'OTHER', NOT 'SAFETY', making it indistinguishable from Issue 1.
Setting BLOCK_NONE on all 5 harm categories immediately fixes it. But because the finish reason doesn’t differentiate, we’re forced to disable all safety filters as a blanket workaround. The API should return finishReason: 'SAFETY' when content is filtered.
3. Multi-speaker mode hallucinates dialogue lines (high)
The model deterministically inserts backchannel lines not present in the input script (e.g., “Not as obvious, right?” at a dialogue transition). Same input, same hallucination, same position every time. The model appears to fill “conversational gaps.”
We’ve lowered temperature to 0.5 and added turn-count validation, but there doesn’t seem to be a strict/faithful rendering mode.
4. Multi-speaker voice swapping (medium)
Speaker1 occasionally gets Speaker2’s voice despite correct multiSpeakerVoiceConfig. Gets worse with longer chunks. We’ve seen this confirmed on other forum threads as well. Our workaround is keeping chunks under 3000 chars and adding voice identity reinforcement in every prompt.
5. Multi-speaker line duplication and skipping (medium)
The model occasionally skips input lines and duplicates earlier lines in their place. Non-deterministic. We’ve confirmed no bugs in our chunking or formatting logic.
6. Short utterance distortion with context prompts (medium)
Short backchannel utterances (“Mhm”, “Right”) get distorted when sent with a full director-style context prompt. The model seems to over-process short inputs with lengthy context. Our workaround is skipping the director prompt for utterances under a character threshold.
7. SDK generateContentStream ignores AbortController (low)
The @google/genai SDK’s generateContentStream does not respect AbortController signals. We can’t cancel in-flight streaming requests.
Questions for the Gemini team:
-
Are any of these tracked internally? Any internal issue IDs?
-
Is there a strict/faithful rendering mode for multi-speaker TTS that outputs exactly the input text?
-
Any ETA on
finishReasonreturning'SAFETY'instead of'OTHER'for filtered content? -
Recommended maximum chunk size for multi-speaker mode to minimize voice swapping?
We’re committed to Gemini TTS. The multi-speaker mode is genuinely a differentiator for our product and we want to make it work. Happy to provide logs, audio samples, and reproduction scripts.