Inconsistent Audio Output with Gemini 2.5 Pro Preview TTS

Is there a way to maintain consistency in Gemini speech generation?
I’ve been testing speech-to-text with Gemini 2.5 Pro Preview TTS for a couple of weeks, and even though I always use the same voice (“zephyr”) and the same prompt (“Read aloud in a warm and friendly tone”), almost every audio generation sounds different. Sometimes it’s consistent, but that’s not usually the case.

7 Likes

Hi @Fabi_Rodrigues
When you say the output sounds different, could you describe the variations you’re hearing? For example, is it a change in pitch, cadence, emotion, or something else entirely?
And also, are you trying to do this in AI Studio or using API?

1 Like

Hi @Pannaga_J
That’s right. What really changes is the direction you take — the tone, the rhythm, the emotion. Sometimes the variation is so strong it feels like an entirely different voice. Accents and intonation shift; one delivery might sound rushed while another feels very measured, it might come across as high or deep. And all of this happens without having to adjust the original generation settings. (in AI Studio)

2 Likes

I am also seeing inconsistent accents for speakers.

1 Like

It’s good not to be alone in this :sweat_smile: .
I’ve tried several options, and it definitely works better with fewer than 600 tokens. However, the variations in accent, emotion, rhythm, and speed remain. I still haven’t found a way to make the model consistent, unlike with ElevenLabs, where the voice always sounds the same once the parameters are set.

1 Like

Based on my observations, an accent change appears to occur approximately once every five generations with Zephyr.
Can you please clarify if the accent inconsistencies are only happening with the Zephyr voice, or if you’ve noticed them with other voices as well?

Hello @Pannaga_J

Thank you for replying. It’s not just about the accent, but also the emotion and the speed. The change happens practically with every generation; in other words, it’s very difficult to find two generations that are the same. The same thing happens to me with Charon, Orus, Algenib, and Sadachbia. I do almost all of my rehearsals in Spanish.

Hi @Fabi_Rodrigues
I previously tested in English and will now include Spanish in my testing. Anyway thank you for raising this point. I will inform the team about these inconsistencies.

1 Like

I too am finding the voices extremely inconsistent across the board, it does seem to sound like a different person when you generate multiple times, especially with accents - for example a British accent, the model really struggles with this.

3 Likes

The question is when we going to get an update to the model so it is consistent?

3 Likes

Hi @Pannaga_J , how has the consultation progressed? Do we have any updates or a solution regarding the inconsistency issue?

I’m noticing this too. I’m using Enceladus in the API. I split a chapter in an audiobook (because the whole chapter got a “fetch failed sending request”) and the two halves are very different. The first half is more rumbling and aggressive while the second half is more gentle. The sad thing is that in a retry of the second half, I got the “fetch failed” error again.

I’ve pretty much accepted that I’ll need to do multiple takes of each chunk to get exactly what I want. And at what seems to be $2ish per generated hour ($20 per million output tokens at 32 tokens per second) that’s only gonna be $50 or so bucks to have 3-4 takes on every chunk for a 100,000 word book. But the only two services I’ve found do a good enough read and they both suffer from the same lack of consistency.

I mean, it’s only costing me about $0.20 per generation and I’ve got plenty of time to use up that massive introductory credit. It’s just that my time isn’t free. Waiting on and reviewing multiple generations to find two that are both good and consistent is a pain.

2 Likes

@Fabi_Rodrigues @Pannaga_J

Hello Everyone…

I’m experiencing the same issue with Gemini TTS. Almost every new generation sounds noticeably different in terms of voice, rhythm, speed, and emotion — to the point where it doesn’t even feel like the same person is speaking.

Even though I’m using the exact same voice and TTS instructions, the consistency breaks as soon as the script changes.

For example, I’m generating long-form content (summaries around 20k words). I split the text into ~10 chunks (about 2k words each) and generate TTS for each chunk separately with the same settings. But when I merge them, each chunk sounds quite different — as if different voices or speaking styles are stitched together.

This makes it very hard to produce smooth, continuous audio.

plz let me know if any of you have found any good fix

3 Likes

I have the same problem. It makes it impossible to work with it. I always have to regenerate some parts of longer texts because of spelling mistakes. The new generated parts sound completely different. Very frustrating, so I guess I have to stick with elevenlabs.

1 Like

Hi @mr.mujheri,

For now, I think the best strategy is to work with fewer than 600–700 tokens. The variations in accent, emotion, rhythm, and speed remain, but by using fewer tokens you’ll be able to generate audio faster and keep the ones that are actually useful. Sad, but that’s the way it is. :broken_heart:

Yes it’s somewhat frustrating. Every generation with same prompt will always results in different audio output. I don’t mind the variation of the expression, but the lack of stability in terms of voice tone, equalizer, and audio quality make short voice generation useless because it cannot maintain consistency. Not worth to be used in my kind of project that requires consistency in every generation.

Same problem here. The multi-speaker mode is a disaster. No generation sounds like another, not in the accent nor in the tone of the voices. There’s no consistency in the voices, the accent, or the intonation… nothing, every generation is different.