I’m having an issue with the Gemini 2.5 Pro Preview TTS model. When I send a single API request with the same text, selected voice (user name), and temperature, the generated audio sometimes changes in tone — and occasionally, the voice sounds slightly different too.
Hii @Anbu_Studioz
Thank you for bringing this to our attention.
Could you please share the full payload details along with a sample of the code that you are using? We would like to reproduce the issue.
Hi @Shivam_Singh2,
I am experiencing this exact same issue with the Gemini TTS endpoints. In our application, when we send technical text inputs, the audio style processor destabilizes mid-output causing severe pitch distortion, tone dropping, and an unexpected gender flip (the male voice profile completely mutates into a female voice register).
Per Google’s official speech generation limitations documentation, this seems to be a variation of the known “Voice inconsistency with prompt instructions” bug. We’ve tested both prompt optimization and application-side text-chunking (sentence slicing), but the stateless nature of the requests causes the voice profile to randomly re-initialize its acoustic parameters across sequential chunks, creating a highly disjointed “multiverse of voices” effect.
Below are our exact system details, code implementation, and reconstructed API payload for replication.
Environment & Configuration Details
-
Target Model Endpoint:
gemini-2.5-flash-preview-tts -
Voice Preset Profile:
Puck(Male) -
Generation Parameters: Default system values (No explicit
temperature,top_p, ortop_kare defined in the config)
Reconstructed E2E API Request Payload
{
"model": "gemini-2.5-flash-preview-tts",
"contents": "Speak the following text naturally as speech. Follow these guidelines:\n- Language: English\n- For multilingual text (mixing English with Hindi, Punjabi, Tamil...), pronounce each word in its native language naturally\n- Ignore and skip over special characters like quotes, asterisks, hashtags...\n- Convert numbers to their word equivalents\n- Maintain natural pauses at commas and periods\n- Use appropriate intonation and emotion based on context\n\nText to speak:\nCan you describe a time when you used boundary value analysis in manual testing, and explain how it helped you identify defects or improve test coverage?",
"config": {
"response_modalities": ["AUDIO"],
"speech_config": {
"voice_config": {
"prebuilt_voice_config": {
"voice_name": "Puck"
}
}
}
}
}
Any insights on how to enforce voice consistency or stabilize the speaker profile across long-form/technical token payloads on the preview tier would be highly appreciated. Thank you!