Why was the TTS model nerfed on December 10th?

It’s very clear that the updated TTS model is inferior to its predecessor on multiple fronts.

  1. Metallic noise in the generated audio files
  2. Expressivity and prompt adherence degraded. Pro model is also worse, yet somehow slower than befpre.
  3. Worst problem of all: voice identity and timbre has all changed.

Anyone else facing the same issues?

2 Likes

Hi @Chuck, thanks for reaching out!

Could you please share the model you are using? and also if possible the exact prompt for which you are getting problems, so that we can analyze better.

Yes, the voices now sound more robotic, fake and don’t change tone properly.
How do we revert to the older models?

I was doing a podcast and it sounded really well and natural. Now it sounds fake and like an “AI slop“

Alnilam is severely downgraded and changed. Achernar less so, but they both sound inadequate.
Worst is the lack of consistency not only between chunks (some faster, some slower…) but also within a single response.

Here are two examples. Both files contain snippets of pre and after nerf recording - using the flash models.

The firs one is about the change of personality.

Notice how the lively natural conversation is replaced by pretend suspense building a-la children television show.

This one is more problematic:

  1. Alnilam changes voice completely within the single API response
  2. Giant pause with awkward resolution.

The new voices generally sound fake and awkward. Pre Dec 10 it was a dynamic, natural, easy to follow chat. Now it is like a neighborhood theater open-mic event. People used to ask me if it is really AI as is sounded so good, now it is obviously fake and with some kind of an amateurish pathos.

P.S. This is the prompt for the second file

SARAH
This episode releases on December 1st, and as the community knows, that means it’s World AIDS Day. Victor, what’s on your mind today as we mark this date?

VICTOR
(Pauses reflectively)
It’s a powerful day for memory, of course. For remembering the resilience we built out of necessity. But a day like this has to be about accountability, too. We’re digging into the ECDC’s 2025 HIV Report today—it’s a major community check-in.

SARAH
(Seriously)
And the news is… complicated. The report essentially warns us about a “Hidden Crisis” in Europe. Victor, the number that slapped me in the face was this: 54% of all new HIV diagnoses in 2024 were late.

VICTOR
Exactly. Over half of the people diagnosed are already immune-compromised. That’s why we’re breaking this data down today: to understand the gap between our amazing success in treatment and our massive failure in testing.

SARAH
We’ll simplify the numbers, talk about the barriers like stigma and PrEP access, and—most importantly—what we can practically do this week. Let’s dive in.

1 Like

Hi,

Are you still facing this issue?

Yes we are. This happens for ALL the models, for EVERY generation, for EVERY prompt.

Please do me a favor: deploy your older models internally and simply listen to how the audio has changed. It’s extremely obvious.

Did some tests today and actually recorded a full podcast episode to get the feeling how the model behaves now. For me it is way better!

The pro voices still sound mostly flu-sick and depressed, but the flash is quite good. The experience will probably vary by voice, like I noticed Orus is now completely different.
From the voices that I use:

  • Achernar (flash) is better than ever! Has the old liveliness back with a polish to sharpen the edges.
  • Alnilam (flash) is still somewhat unstable, not as much as before, so in a dynamic podcast-form dialogue it is mostly OK, but there are still inconsistencies - sometimes he sounds younger, sometimes older, sometimes the melancholic monotonic performance from the past weeks comes back for a set of lines.
  • Sometimes background noise appears mid-chunk.

On one hand mixing a bit of acting in the natural talk is OK, but since the “AI director“ operates within the chunk’s context this creates continuity problems - the start of a new chunk sounds radically different. I’m not sure what the balance here should be as the added “acting“ does fix some of the rough edges of the voices from before December.
Maybe an option would be to send the full text (or the prev and next paragraph) for context and an instruction which part to actually voice.

Here is a short clip:

Voices keep changing and mutating not only between chunks, but within a single chunk. Changes from younger to older sounding, or other similar variations that create a creepy uncanny valley experience. So unfortunately the nerf in December is still pretty much an issue.