Inconsistent Character Voice Output and Failure to Interpret Amharic Heteronyms Based on Context

The voice generation system produces inconsistent results across multiple generations. When generating audio for the same character, the second output often differs significantly from the first. This includes changes in pitch, tone, emotion, sound quality, and even the overall identity of the voice. Voice characteristics should remain stable across generations.

Additionally, the model does not correctly interpret Amharic heteronyms. Amharic contains many words that share the same spelling but have different meanings depending on pronunciation. The model currently selects meanings inconsistently—often at random—because it does not appear to use contextual cues during translation or text-to-speech processing.

Examples of affected Amharic heteronyms:

  • ገና

    • gena: “still / yet”

    • genna: “Christmas / day of glory”

  • ለጋ

    • lega: “fresh / young”

    • legga: “hit” (verb)

  • ዋና

    • wana: “swimming”

    • wanna: “main / core”

  • ሽፍታ

    • shifta: “rebel”

    • shiffta: “rash” (skin condition)

  • ከፋ

    • kefa: place name

    • keffa: “worse”

For comparison, this issue is similar to English heteronyms such as lead (verb) vs. lead (metal) or tear (cry) vs. tear (rip)—where meaning depends entirely on context. The current system does not reliably apply contextual logic when interpreting these words.

Impact:

  • Incorrect translations and voice outputs

  • Loss of meaning in Amharic text

  • Unreliable character consistency in audio

  • Overall degradation of user experience for Amharic speakers

Expected Behavior:

  1. Voice characteristics (pitch, tone, emotion, identity) should remain consistent across generations for the same character.

  2. The model should accurately interpret Amharic heteronyms based on context, similar to how it handles English heteronyms.

  3. Audio output should apply a stronger and more natural Amharic accent when generating Amharic speech.

Steps to Reproduce:

  1. Generate audio for any character using Amharic text.

  2. Regenerate the same line or paragraph using the same character.

  3. Compare the voice quality and interpretation of heteronyms between generations.

Additional Request:
If possible, please enhance the Amharic accent realism and consistency in generated speech.

Hi @tifo_plus

Thank you for reaching out to us.
Could you please let us know which model you are using? Additionally, could you share a minimal code example that we can review on our end? This will help us better understand and analyze the issue so we can provide a more accurate and helpful response.

1 Like

Hi @Shivam_Singh2,

Thanks for the response.

I’m using models/gemini-2.5-pro-preview-tts via the Gemini API for text-to-speech generation.

The issues occur consistently with this model:

  • Voice instability across regenerations for the same character (changes in pitch, tone, emotion, and perceived identity).

  • Incorrect contextual interpretation of Amharic heteronyms, where pronunciation and meaning vary randomly despite clear context (e.g., ገና, ለጋ, ዋና, etc.).

I can share a minimal code example if needed, but the issue is reproducible by simply regenerating the same Amharic text multiple times with the same settings.

Please let me know if there are recommended parameters, constraints, or upcoming fixes to:

  • enforce voice consistency, and

  • improve contextual disambiguation for Amharic in TTS.

Thanks for your help.

Hello,

Could you please share the full payload details along with a full sample of the code that you are using? We would like to reproduce the issue.