Gemini 2.5 Pro Preview TTS: Inconsistent Voice and Tone Output

I’m having an issue with the Gemini 2.5 Pro Preview TTS model. When I send a single API request with the same text, selected voice (user name), and temperature, the generated audio sometimes changes in tone — and occasionally, the voice sounds slightly different too.

Hii @Anbu_Studioz

Thank you for bringing this to our attention.
Could you please share the full payload details along with a sample of the code that you are using? We would like to reproduce the issue.

Hi @Shivam_Singh2,

I am experiencing this exact same issue with the Gemini TTS endpoints. In our application, when we send technical text inputs, the audio style processor destabilizes mid-output causing severe pitch distortion, tone dropping, and an unexpected gender flip (the male voice profile completely mutates into a female voice register).

Per Google’s official speech generation limitations documentation, this seems to be a variation of the known “Voice inconsistency with prompt instructions” bug. We’ve tested both prompt optimization and application-side text-chunking (sentence slicing), but the stateless nature of the requests causes the voice profile to randomly re-initialize its acoustic parameters across sequential chunks, creating a highly disjointed “multiverse of voices” effect.

Below are our exact system details, code implementation, and reconstructed API payload for replication.

Environment & Configuration Details

  • Target Model Endpoint: gemini-2.5-flash-preview-tts

  • Voice Preset Profile: Puck (Male)

  • Generation Parameters: Default system values (No explicit temperature, top_p, or top_k are defined in the config)

Reconstructed E2E API Request Payload

{
  "model": "gemini-2.5-flash-preview-tts",
  "contents": "Speak the following text naturally as speech. Follow these guidelines:\n- Language: English\n- For multilingual text (mixing English with Hindi, Punjabi, Tamil...), pronounce each word in its native language naturally\n- Ignore and skip over special characters like quotes, asterisks, hashtags...\n- Convert numbers to their word equivalents\n- Maintain natural pauses at commas and periods\n- Use appropriate intonation and emotion based on context\n\nText to speak:\nCan you describe a time when you used boundary value analysis in manual testing, and explain how it helped you identify defects or improve test coverage?",
  "config": {
    "response_modalities": ["AUDIO"],
    "speech_config": {
      "voice_config": {
        "prebuilt_voice_config": {
          "voice_name": "Puck"
        }
      }
    }
  }
}

Any insights on how to enforce voice consistency or stabilize the speaker profile across long-form/technical token payloads on the preview tier would be highly appreciated. Thank you!