Gemini TTS Multi-Speaker Mode: 7 Critical Bugs After 3 Weeks in Production (finishReason 'OTHER', Truncation, Voice Swapping, Hallucinated Lines)

We’ve been running Gemini TTS multi-speaker mode in production for our podcast SaaS platform for three weeks now and have documented seven reproducible issues. We’ve deployed 20+ workarounds on our end and logged 34 separate incidents. Sharing here in case others are hitting the same problems, and hoping the Gemini team can confirm whether these are known and being tracked.

Models tested: gemini-2.5-flash-tts, gemini-2.5-pro-tts, gemini-3.1-pro-preview SDK: @googlegoogle/genai (TypeScript), both Developer API and Vertex AI


1. Non-deterministic audio truncation (critical)

The API returns finishReason: 'OTHER' instead of 'STOP' and stops generating audio mid-stream. We see output containing only 13-46% of expected duration on chunks well under the 2000-character limit.

Example: 1078-char input, expected ~72s output, got 23.99s (33%). Same input succeeds on retry, confirming it’s non-deterministic on the API side.

We estimate 30-40% of our API calls are wasted on retries. We’ve built a 3-layer detection system (finishReason check, duration ratio validation, adaptive chunk splitting) but this is a significant cost and reliability issue.

2. Safety filters silently truncate legitimate content (high)

Default safety filters block standard news podcast scripts about topics like crime, politics, and health. The response returns finishReason: 'OTHER', NOT 'SAFETY', making it indistinguishable from Issue 1.

Setting BLOCK_NONE on all 5 harm categories immediately fixes it. But because the finish reason doesn’t differentiate, we’re forced to disable all safety filters as a blanket workaround. The API should return finishReason: 'SAFETY' when content is filtered.

3. Multi-speaker mode hallucinates dialogue lines (high)

The model deterministically inserts backchannel lines not present in the input script (e.g., “Not as obvious, right?” at a dialogue transition). Same input, same hallucination, same position every time. The model appears to fill “conversational gaps.”

We’ve lowered temperature to 0.5 and added turn-count validation, but there doesn’t seem to be a strict/faithful rendering mode.

4. Multi-speaker voice swapping (medium)

Speaker1 occasionally gets Speaker2’s voice despite correct multiSpeakerVoiceConfig. Gets worse with longer chunks. We’ve seen this confirmed on other forum threads as well. Our workaround is keeping chunks under 3000 chars and adding voice identity reinforcement in every prompt.

5. Multi-speaker line duplication and skipping (medium)

The model occasionally skips input lines and duplicates earlier lines in their place. Non-deterministic. We’ve confirmed no bugs in our chunking or formatting logic.

6. Short utterance distortion with context prompts (medium)

Short backchannel utterances (“Mhm”, “Right”) get distorted when sent with a full director-style context prompt. The model seems to over-process short inputs with lengthy context. Our workaround is skipping the director prompt for utterances under a character threshold.

7. SDK generateContentStream ignores AbortController (low)

The @google/genai SDK’s generateContentStream does not respect AbortController signals. We can’t cancel in-flight streaming requests.


Questions for the Gemini team:

  • Are any of these tracked internally? Any internal issue IDs?

  • Is there a strict/faithful rendering mode for multi-speaker TTS that outputs exactly the input text?

  • Any ETA on finishReason returning 'SAFETY' instead of 'OTHER' for filtered content?

  • Recommended maximum chunk size for multi-speaker mode to minimize voice swapping?

We’re committed to Gemini TTS. The multi-speaker mode is genuinely a differentiator for our product and we want to make it work. Happy to provide logs, audio samples, and reproduction scripts.

We are having same exact issues on our project and trying to find a solution for this to go production release.

Non-deterministic audio truncation” this one is really a critical bug, on our scenario we want to let users start listening generated content on the fly. So retrying does not work for our case. By changing prompts we could reduce the error rate but we still have this problem occasionally and don’t know what’s the exact reason.

It would be great to hear back about these from Gemini team.

We found a workaround for the truncation issue, and I’m sharing it here because Google Support certainly didn’t help us find it.

The short version: switch from Vertex AI (aiplatform.googleapis.com) to the Developer API / Google AI Studio endpoint (generativelanguage.googleapis.com). Same model, same input, same config. The truncation issue is isolated to the Vertex AI infrastructure.

Our test results with gemini-2.5-pro-preview-tts, identical payload:

  • Developer API (REST): 0 truncations across all chunks. finishReason: STOP every time. Full audio output.
  • Vertex AI: 7 out of 12 chunks truncated on first attempt. finishReason: OTHER. Output ranged from 16% to 85% of expected duration.

We ran this test multiple times with different scripts and the pattern is consistent. The Developer API also produces noticeably better audio quality on the same model, which suggests the serving configurations aren’t identical between the two endpoints.

Google eventually confirmed this after six days, stating that the Developer API and Vertex AI use different orchestration layers and infrastructure, and that this is related to a capacity incident from Feb 27 that left residual edge cases for multi-speaker TTS payloads.

One caveat: the Developer API has lower default rate limits than Vertex AI, so if you’re doing production volume you’ll need to request a tier upgrade through the Google AI Studio form.

Now, about the support experience, because I think the community should know what to expect:

We opened a P1 (Critical) support case on March 11 with detailed documentation of all 7 issues, production logs, and audio samples. Over the next six days we received 20+ messages from 8+ different support agents across timezone rotations. Every single message until day 6 was a generic holding template: “the product specialist team is actively working on it, expect an update by [rolling ETA].” Not one contained any technical content.

We escalated the case. The escalation manager joined a call, committed to a specialist response within 1-2 hours, and then we received the same template message three hours later. We provided a minimal reproduction script within an hour of being asked. We attached audio files, full request bodies, per-chunk truncation ratios. None of it was referenced in any response until day 6.

The workaround we’re sharing here? We found it ourselves on day 4 by running our own comparative test between the two endpoints. We shared the results with Google Support, and two days later they confirmed our findings and recommended what we had already done.

I’m not sharing this to bash the support team individually. But if you’re building production systems on Gemini TTS, you should know that P1 support for this product currently means generic delay messages for nearly a week, and you’re better off debugging it yourself and sharing your findings with them.

The remaining issues from my original post (hallucinated dialogue lines, voice swapping, line duplication, safety filter false positives) are model-level problems that affect both endpoints. No update on those yet. If anyone has found workarounds for the hallucinated backchannel lines in multi-speaker mode, I’d love to hear what’s working for you.

Hope this saves someone else a week of back-and-forth.