Urgent Feedback & Call for Correction: A Serious Breach of Developer Trust and Stability (Update: Still silence from Google)

Google releasing a ‘developer’ update prior to IO hoping that developers could use it to build more, which actually made us unable to build.

Didn’t they test the model before release?

1 Like

While the newer version does follow instructions better. The march one was the first truly objective and critical model ive had encountered.

It literally didn`t handhold and gave critical honest feedback, even under wrong assumptions. Where all other models were just parroting what they thought the user wants to hear.

Google please keep the old models around. Always after finetuning for the masses these models went to shit. GPT4 shortly after release was better than what we got now. Claude 3.5 also better than 3.7 in many instances. Its just repeating over and over again. Humans always pick the feedback that makes em feel good and in return you get an emoticon pissing sycophant ai.

3 Likes

@Michael_Six, while I’m trying to keep this thread laser-focused on Google’s policy for dated endpoint names, your observations on model performance are too important not to comment on.

I’m right there with you on 03-25. It was the first model that genuinely pushed back in a way that felt truly meaningful. I’ve been a heavy user of LLMs personally and professionally since OpenAI’s earliest public models back in 2022, run billions of tokens through them, and I will never forget the first time a model made me question my own assumptions; that was 03-25. I pushed back, expecting the usual apology and course correction you get from every other model. It didn’t apologize. It doubled down, and it shocked me into realizing I was the one who was wrong. That kind of interaction is incredibly valuable, something I hadn’t seen before.

And you’re 100% right with this whole A/B testing and fine-tuning cycle. We’ve seen it AGAIN and AGAIN. The current chatgpt-4o-latest is, frankly, a mess compared to the base 4o model because of it. You’re not getting reinforced feedback from actual “experts” in reasoning, like OpenAI did with their ‘o’ series models; you’re getting reinforcement from what general users “liked” because it was agreeable and didn’t challenge them. I’d hate to see Gemini go down that same path. Fine-tuning, IMO, should almost never be for personality; let users handle that with a system message. Trying to fine-tune personality at this scale has massive, unforeseen side effects every single time.

On your point about instruction following, I have to somewhat disagree that 05-06 is better. My personal testing seems to lean toward it being worse over longer contexts. Now, this is totally anecdotal, and I have zero eval data to back up that personal claim, so take it for what it is worth, but I can point directly to the public benchmark for long fiction recall. That’s, in my opinion, an excellent indicator of coherent understanding over extended contexts. The 03-25 model hitting 90% at 120k tokens was groundbreaking and a real industry first. I confirmed it myself that it was just a benchmark score; it translated to real-world understanding at long context that was remarkably good, only really falling apart around 150k to 200k tokens.

If you check the chart, you’ll see gemini-2.5-pro-preview-03-25 and gemini-2.5-pro-preview-05-06 now show nearly identical performance. That’s because, as we know, the 03-25 endpoint got silently redirected to the 05-06 checkpoint. But as you can see, the 03-25-exp still shows the original, much higher performance, because it is still the original model. I’ve personally verified this by checking the response object from the Gemini API; that specific experimental endpoint is NOT redirected and still points to the actual March 25 checkpoint.

So the data clearly shows 05-06 is a significant regression in long-context understanding, dropping from an exceptionally outstanding model to something more in line with current SOTA averages.

I believe understanding over context translates directly to the ability to follow instructions over long context as well, to my point.

4 Likes

Apart from long-context problems, as you’ve identified, we’re also seeing substantially longer response times, combined with an alarming ability to ‘not think’, which all lead to increased hallucinations and far worse overall results.

We do have internal evals, and they’re now consistently failing in lots of cases. This is a crying shame for both end users and other developers like us. They giveth with one hand while taking away with another :slight_smile:

2 Likes

A quick but important update: Logan Kilpatrick has announced a “temporary pause” on Gemini 2.5 Pro free tier API access, citing high demand.

This is significant because it effectively removes API access to gemini-2.5-pro-exp-03-25, which many of us confirmed was the actual, non-redirected March 25th checkpoint.

It’s hard not to connect this pause with the likely surge in usage on that specific exp-03-25 endpoint, once word spread that it was the true 03-25 model, especially after the main preview endpoint was redirected.

This continues to show the urgent need for Google to clarify its policy on dated endpoints. We still have no official word on their immutability or the conditions for redirection or removal, which continues to hurt developers’ trust and our ability to build reliably.

Hoping for clarity from Google soon.

6 Likes

Thank you @H_Express for this thread. I will keep this brief as I am not using the API at the moment, but I was planning to start building on it after witnessing the capabilities of Gemini Pro 2.5 (03-25).

For the past few weeks I have been using Gemini for machine learning research and its ability to push back on ideas and provide genuine technical insight struck me as a major leap ahead – for me the second “uh-oh” moment, the first one being with the original gpt-4. Gemini was acting as a competent collaborator in fairly technical fields (not just for coding). Since the 05-06 update, most of that is gone, as mentioned by @Michael_Six. The new model’s thoughts patterns and writing are not as sharp, and sometimes just confused in a way reminiscent of lesser models, with glaring attentional mistakes that muddle together information from different parts of the context or its own thought chain, something 03-25 would not do.

I started looking around and found this thread (together with others). Hopefully there will be clarity about what is happening, and 03-25 will be made available again.

2 Likes

Dear @Govind_Keshari
Are you able to provide any acknowledgement of this issue? Has this thread been attended to by the internal team?

Many Thanks

1 Like

This is very unfortunate. I spent a lot of time working on our use cases in the area of legal text comparison with gemini-2.5-pro-exp-03-25. The results were great. Now things are broken. I hope there’s a way to get back API access to 03-25.

2 Likes

Yes, 100%

For myself, and for many others, the quality of the 3-25 model is what drove us to adopt it over other models we’d been using.

Redirecting silently to the FAR inferior 5-06 version, requiring significant rewriting of system prompts in order to restore some of the innate functionality of the 3-25 model, although still falling short, is beyond frustrating.

Please restore access to 3-25!

2 Likes

SHAME ON YOU GOOGLE, You will harm yourselves with your greedy practices.

Since Google (and @Logan_Kilpatrick) continue to ignore this thread and refuse to clarify anything related to their new apparent policy of redirecting clearly dated endpoints, several of us have been forced to investigate alternative solutions on our own time.

I wanted to share a quick update about what we’ve found that might help others whose applications were broken by Google’s backdoor model switch (now that even the exp endpoint has been disabled on the Gemini API per Logan’s announcement).

The good news: We’ve confirmed that the original March 25th checkpoint IS still accessible, but only through the Vertex AI API, not the Gemini API.

The response object from Vertex does give you a model name identifier that seemed to confirm there is NO redirect in place, but instead of trusting it (since that is what this whole issue is about), I decided to do research to make absolutely certain.

Using a fixed seed number and temperature=0 in the API for as deterministic an output as currently possible, we ran tests comparing the output “signatures” (content, logprobs, token counts) of different dated endpoints on Vertex AI. Our findings:

  1. NOT Redirected on Vertex: On Vertex AI, the gemini-2.5-pro-preview-03-25 endpoint produces completely different output signatures than the gemini-2.5-pro-preview-05-06 endpoint. They are definitely NOT the same model.

  2. The gemini-2.5-pro-exp-03-25 endpoint on Vertex (when you can access it) produces output signatures IDENTICAL to the original March 25th model we all came to rely on. However, this endpoint appears severely rate-limited.

The result you, as a dev, care about: The preview-03-25 endpoint on Vertex is NOT redirected to 05-06, like on the Gemini API. exp-03-25 is rate-limited but confirmed to be IDENTICAL in signature to preview-03-25. Therefore, developers who need to use the original March model can switch to Vertex AI as a temporary solution, confidently using preview-03-25 in place of the exp endpoint.

@wowitsjack, @Ed_Godshaw - if you switch your apps to Vertex AI instead of Gemini API (straightforward in most SDKs), this could be your path back to stability until Google either fixes the 05-06 model issues or (hopefully) provides an official way to access the original 03-25 model again.

Is this ideal? Absolutely not. We shouldn’t have to do detective work to find out where Google hid a model that was working perfectly. The breach of trust regarding dated endpoints remains unaddressed, and Google’s continued silence on this thread speaks volumes.

But at least there’s a potential path forward for those whose applications are currently broken due to this sudden switch. I hope this helps at least one frustrated and confused developer out there.

5 Likes