Hello,
I am currently evaluating Gemini TTS for generating long-form narration and I am planning to use Google Cloud for production if the results are stable.
Before subscribing to the Cloud API, I have been testing gemini-tts-2.5-pro in AI Studio for the past few days. However, I am consistently encountering several issues that make the generated audio unreliable.
The main problems I am experiencing are:
-
Long blank/noise segments
The model often begins speaking normally for the first few sentences but then produces around 10 minutes of blank noise. This happens with multiple voices, but it occurs most frequently with the Fenrir voice (which I chose to use for my usecase). -
Voice quality degradation over time
For audio around 4–5 minutes long, the voice sounds very natural at the beginning. However, as the audio progresses, the voice gradually becomes metallic or robotic and sometimes develops an echo-like effect. And also includes background noises somtimes. -
Automatic pacing changes
I try to generate narration at a slow-to-medium speaking pace. The audio usually starts at the correct speed, but as the narration continues, the speaking speed gradually increases without any instruction.
Because of these issues, I have not been able to generate a complete narration without problems.
Transcript length: I tried input text length of 50 words to 1000 words. So far wasn’t able to generate a script with more than 500 words and for 500 words script the audio faces the second and third issues (voice degradation and pacing change).
Temperature: I tried the default temperature and values as low as 0.5 but all issues persist.
Instruction prompt: Used instruction prompt as simple as a single line instruction to a complex Director’s notes prompt. Also tried without any instruction prompt (with just the transcript). But issue persists.
This issue wasn’t so consistent before 15h March (although, I only tried small scripts till that time and they were fine). But after 15th March suddenly it became worse. If I generate 10 audios, only one would be acceptable (that too with compromise). I try to keep the script small so the audio doesn’t exceed 5 mins.
My main question is:
Are these issues specific to AI Studio, or should I expect similar behavior when using the Gemini TTS model through the Google Cloud API (Vertex AI / Gemini API)?
When can we expect a stable solution?
If anyone has experience generating longer narration audio with gemini-tts-2.5-pro, I would appreciate any guidance or best practices that help avoid these problems.
Thank you.