Inconsistent Audio Output with Gemini 2.5 Pro Preview TTS

Fabi_Rodrigues · September 1, 2025, 11:21am

Is there a way to maintain consistency in Gemini speech generation?
I’ve been testing speech-to-text with Gemini 2.5 Pro Preview TTS for a couple of weeks, and even though I always use the same voice (“zephyr”) and the same prompt (“Read aloud in a warm and friendly tone”), almost every audio generation sounds different. Sometimes it’s consistent, but that’s not usually the case.

Pannaga_J · September 2, 2025, 3:15pm

Hi @Fabi_Rodrigues
When you say the output sounds different, could you describe the variations you’re hearing? For example, is it a change in pitch, cadence, emotion, or something else entirely?
And also, are you trying to do this in AI Studio or using API?

Fabi_Rodrigues · September 3, 2025, 6:27am

Hi @Pannaga_J
That’s right. What really changes is the direction you take — the tone, the rhythm, the emotion. Sometimes the variation is so strong it feels like an entirely different voice. Accents and intonation shift; one delivery might sound rushed while another feels very measured, it might come across as high or deep. And all of this happens without having to adjust the original generation settings. (in AI Studio)

Andy_Tyrer · September 4, 2025, 11:56am

I am also seeing inconsistent accents for speakers.

Fabi_Rodrigues · September 4, 2025, 12:04pm

It’s good not to be alone in this .
I’ve tried several options, and it definitely works better with fewer than 600 tokens. However, the variations in accent, emotion, rhythm, and speed remain. I still haven’t found a way to make the model consistent, unlike with ElevenLabs, where the voice always sounds the same once the parameters are set.

Pannaga_J · September 11, 2025, 9:05am

Based on my observations, an accent change appears to occur approximately once every five generations with Zephyr.
Can you please clarify if the accent inconsistencies are only happening with the Zephyr voice, or if you’ve noticed them with other voices as well?

Fabi_Rodrigues · September 11, 2025, 9:22am

Hello @Pannaga_J

Thank you for replying. It’s not just about the accent, but also the emotion and the speed. The change happens practically with every generation; in other words, it’s very difficult to find two generations that are the same. The same thing happens to me with Charon, Orus, Algenib, and Sadachbia. I do almost all of my rehearsals in Spanish.

Pannaga_J · September 11, 2025, 10:32am

Hi @Fabi_Rodrigues
I previously tested in English and will now include Spanish in my testing. Anyway thank you for raising this point. I will inform the team about these inconsistencies.

Andy_Tyrer · September 11, 2025, 1:16pm

I too am finding the voices extremely inconsistent across the board, it does seem to sound like a different person when you generate multiple times, especially with accents - for example a British accent, the model really struggles with this.

Andy_Tyrer · September 14, 2025, 7:37pm

The question is when we going to get an update to the model so it is consistent?

Fabi_Rodrigues · September 24, 2025, 8:28am

Hi @Pannaga_J , how has the consultation progressed? Do we have any updates or a solution regarding the inconsistency issue?

Greg_Bulmash · September 24, 2025, 11:28pm

I’m noticing this too. I’m using Enceladus in the API. I split a chapter in an audiobook (because the whole chapter got a “fetch failed sending request”) and the two halves are very different. The first half is more rumbling and aggressive while the second half is more gentle. The sad thing is that in a retry of the second half, I got the “fetch failed” error again.

I’ve pretty much accepted that I’ll need to do multiple takes of each chunk to get exactly what I want. And at what seems to be $2ish per generated hour ($20 per million output tokens at 32 tokens per second) that’s only gonna be $50 or so bucks to have 3-4 takes on every chunk for a 100,000 word book. But the only two services I’ve found do a good enough read and they both suffer from the same lack of consistency.

I mean, it’s only costing me about $0.20 per generation and I’ve got plenty of time to use up that massive introductory credit. It’s just that my time isn’t free. Waiting on and reviewing multiple generations to find two that are both good and consistent is a pain.

mr.mujheri · September 28, 2025, 11:47am

@Fabi_Rodrigues @Pannaga_J

Hello Everyone…

I’m experiencing the same issue with Gemini TTS. Almost every new generation sounds noticeably different in terms of voice, rhythm, speed, and emotion — to the point where it doesn’t even feel like the same person is speaking.

Even though I’m using the exact same voice and TTS instructions, the consistency breaks as soon as the script changes.

For example, I’m generating long-form content (summaries around 20k words). I split the text into ~10 chunks (about 2k words each) and generate TTS for each chunk separately with the same settings. But when I merge them, each chunk sounds quite different — as if different voices or speaking styles are stitched together.

This makes it very hard to produce smooth, continuous audio.

plz let me know if any of you have found any good fix

Tobias_B · October 1, 2025, 10:57am

I have the same problem. It makes it impossible to work with it. I always have to regenerate some parts of longer texts because of spelling mistakes. The new generated parts sound completely different. Very frustrating, so I guess I have to stick with elevenlabs.

Fabi_Rodrigues · October 1, 2025, 11:36am

Hi @mr.mujheri,

For now, I think the best strategy is to work with fewer than 600–700 tokens. The variations in accent, emotion, rhythm, and speed remain, but by using fewer tokens you’ll be able to generate audio faster and keep the ones that are actually useful. Sad, but that’s the way it is.

DigitalVei · October 8, 2025, 8:54pm

Yes it’s somewhat frustrating. Every generation with same prompt will always results in different audio output. I don’t mind the variation of the expression, but the lack of stability in terms of voice tone, equalizer, and audio quality make short voice generation useless because it cannot maintain consistency. Not worth to be used in my kind of project that requires consistency in every generation.

vasp · October 21, 2025, 4:05am

Same problem here. The multi-speaker mode is a disaster. No generation sounds like another, not in the accent nor in the tone of the voices. There’s no consistency in the voices, the accent, or the intonation… nothing, every generation is different.

AceShabbir · November 9, 2025, 12:58pm

since so many people are having trouble with it. Can somebody from Google take responsibility and fix it asap

Gregory_Hyman · November 12, 2025, 8:21pm

Just piling on to say I’m experiencing this issue as well. I’m trying to use the model for long-form multispeaker content, but generations/chunks too frequently sound dramatically different from one another in their voicing: I don’t know if it’s timber or tone or what, but it’s jarring.

J_Louise · November 18, 2025, 2:58am

I can verify that it’s happening with other voices.

I am currently using Sadachbia. I first started by giving him styling. Then I removed all styling. Still the same results. The pitch is different, pronunciation of words is different. One variation was PERFECT except it had music faintly playing in the background.

Broke my heart!

If there’s a way to send you the variations, just let me know. I have some for you to check out.

Topic		Replies	Views
Gemini 2.5 Pro TTS (Text-to-Speech) Review & Urgent Request for Commercial GA Gemini API ai-studio , api , gemini_25_pro	7	482	November 27, 2025
TTS audio generation background noise Google AI Studio gemini-flash , gemini_25_pro	11	491	January 6, 2026
Gemini TTS ignores per-speaker voice settings in multi-character prompts Google AI Studio bug	9	723	November 28, 2025
Audio Output with Gemini 2.5 Pro Preview TTS is TOTALLY random Gemini API audio	2	229	November 24, 2025
Inconsistent Character Voice Output and Failure to Interpret Amharic Heteronyms Based on Context Gemini API ai-studio , api , models , gemini , audio	3	91	December 29, 2025

Inconsistent Audio Output with Gemini 2.5 Pro Preview TTS

Related topics