Gemini TTS ignores per-speaker voice settings in multi-character prompts

Anyone else having this issue?
I’ve set different voices for each speaker in a multi-character script using Gemini TTS, but the model only uses one voice for the entire output!
Speaker names are clearly defined, and the settings are configured properly, yet no voice switching happens. I’ve noticed that this happens only when the script for the TTS is long. It had no problem with voice switching when it had to say 2-3 sentences.

This really limits the use of Gemini!

Would appreciate any fixes, workarounds, or confirmation from the team if this is a known bug.

Thanks!

2 Likes

I am also seeing this issue when using the API, it doesn’t even need to be a long script. It feels like the tone of the voice or the gender is also not properly respected in the API based on the samples available in AI Studio.

3 Likes

Yes, I am having the same issue in the API. The UI works, but the code copied from the AI Google Studio is not working.

1 Like

Have you tried downloading the sound? After downloading the sound is crash

2 Likes

Hi @jokesonyou @Djalma_Araujo @felipem
I’ve successfully reproduced the scenario on my end, and it’s now working as expected in both AI Studio and with the copied code. Voice switching is functioning perfectly.Please re-check and let me know if you are still experiencing the issue.
Thank you

1 Like

It’s partially fixed. I’ve regenerated a couple of times and sometimes it gets it right, sometimes half the audio is voice switching correctly a and the other half is wrong. Maybe you wanna recreate the issue? Here’s the prompt directly from the right side of the AI Studio interface:

Use strong British roadman slang, drop your T's, hit them glottal stops — man ain't sayin' "button," it's "bu'on." Emphasise key slang words like fam, bruv, ting, Ps, g, bally, crud. Deliver bars like "Run the till or man'll mek it long" and "Don't be a hero, this ain't Avengers" with menace, like man's really on smoke. Add quick scoffs or laughs after jokes (e.g. "Loool," "Dead," or short chuckles). Use natural London street energy, bounce, and swagger. Slight pauses before punchlines. Speak like you're in a music video or after a cheeky robbery gone smooth. Keep it greazy, keep it raw.
Ready? You're TYRONE and JAY. Let's go:

TYRONE:
OG, man can't even cap, that was bare peak still. You clutched the cashier's face when man whipped out the tings.

JAY:
Swear down man, man looked like he seen a jinn fam. I hit him with the "run the till or man'll make it long."

TYRONE:
Blood started trembling like man's wearing flip-flops in December. Dead. Man folded quicker than a Primark T after one wash. And that little yute behind the till, trying to press panic like it's GTA. Man thinks he's in a Marvel thing.

JAY:
I told him, "don't get gassed fam, man ain't no superhero. This is real block biz." Bro froze like XP with 10 tabs open.

TYRONE:
Nah, you looked on crud though. Bali tight like man was born in it. Even mumsy wouldn't have clocked you, no lie.

JAY:
Pattern Bifferent, G. Can't get caught lacking. Don't want man's face on the feds' TikTok special.

TYRONE:
Come on. You grabbed the bread, yeah?

JAY:
Yeah G, man's got the Ps patterned. Nothing wild, just some light bread. Nando's and a clean tech fleece. Man's good.

TYRONE:
Say less. Next one we hit something proper. No more dusty corner spots. We going uptown, posh cafe vibes. Bear muffins, bear dough, clean getaway.

JAY:
Man moving mad. I ain't trying to get bagged over a croissant fam. Keep it lowkey, ends only. No extra noise.

TYRONE:
Real talk. Gotta move like mist. Not on them front pages. You got the whip?

JAY:
Yeah G, outside. Engine purring like a cat on loud mode. Let's dip, feds might start sniffing.

TYRONE:
Let's cut, fam. Mission done! Real life GTA but with no cheat codes.

JAY:
Come on man. Levels! Whole 'nother tier.

I’ve replicated the issue on my end for the prompt you shared and voice switching is functioning correctly.

Could you let me know which voice samples you used? I tested with Zephyr and Puck, which worked fine.
From the screenshot i see you have used sadachbia and zubenlgenubi. Still using that ?

I switched to sadachbia and sadaltager. Regenerate it like 2-3 times and you will start noticing voices blending into one another and failing to switch between them. I can see the improvement tho. It doesn’t always fail.

This issue persists in AI Studio generating multi-character TTS. It appears to be intermittent.