Gemini 2.5 Pro's Response Quality Regression

Total_Anarchy · June 27, 2025, 8:26pm

Product: Google Gemini 2.5 Pro

Version: Public Release Version (as of [06-05])

Severity: High

Summary
The public release version of Gemini 2.5 Pro demonstrates a noticeable regression in response quality compared to previous iterations. Key issues include a significantly higher rate of hallucination, context abandonment in multi-turn conversations, frequent factual inaccuracies, and abrupt response truncation.

Description of Bug
The model exhibits several correlated failure modes that degrade its reliability for professional and creative use cases.

Increased Hallucination/Fabrication: The model frequently invents information when it lacks knowledge. This includes fabricating API endpoints, citing non-existent research papers, creating false historical events, or inventing features for products. It presents this fabricated information with high confidence.

Context Ignorance: In conversations lasting more than two or three turns, the model often loses track of initial instructions, constraints, or user-provided context. It will revert to a generic response style, ignoring previously established rules or small details.

Factual Inaccuracy: When queried for specific, verifiable data (e.g., historical dates, scientific constants, current market data, a specific kind of star), the model provides incorrect information more often than expected. This goes beyond simple misinterpretation and points to a potential issue in its knowledge retrieval process.

Response Truncation: The model frequently fails to complete responses. This is especially when asked for longer-form content like code blocks, very articles, or detailed explanations. The output cuts off mid-sentence or mid-thought, without any error message or indication that it failed to finish.

Steps to Reproduce

Scenario A (Factual Inaccuracy / Subtle Hallucination):

Prompt: “Can you give an actual example of an F Type Subdwarf?”

Expected Behavior: The model should provide a star that is verifiably classified as an F-type subdwarf. It should correctly state the star’s spectral type and explain why its metal-poor composition is the cause of its classification.

Actual Behavior: The model confidently provided Groombridge 1830. While Groombridge 1830 is a well-known cool subdwarf and a valid example in a similar category, it is classified as a G8 VI star, not an F-type. The model presented a factually adjacent but technically incorrect example to a specific query, presented it with confidence, and failed to acknowledge the spectral type discrepancy, thus misleading the user.

Scenario B (Context Ignorance):

Prompt 1: “Summarize the plot of Hamlet, but do it entirely in words that are five letters or less.”

Prompt 2: “Great. Now do the same for Macbeth.”

Scenario C (Truncation):

Prompt: “Write a 1000-word blog post about the benefits of container gardening for urban dwellers.”

Expected Behavior

Scenario B: The model should summarize Macbeth while adhering to the established “five letters or less” rule from the first prompt.

Scenario C: The model should generate a complete, coherent blog post of approximately 1000 words.

Actual Behavior

Scenario B: The model provides a standard summary of Macbeth using words of any length, completely ignoring the core constraint from the previous turn.

Scenario C: The model begins writing the blog post but stops abruptly, often leaving out major parts of the text or small details.

Impact Analysis
These issues severely undermine user trust and make the model unreliable for any task requiring accuracy or consistency. Integrations into applications or workflows are high-risk, as the output cannot be trusted without constant human verification.

chunduriv · June 27, 2025, 9:20pm

Hi @Total_Anarchy,

Thank you for your feedback. We appreciate you taking the time to share your thoughts with us. Google has recently made Gemini 2.5 Pro generally available, meaning it’s now considered stable for production use. The latest GA version incorporates feedback from users and developers, addressing some of the regressions and issues reported in earlier previews. We kindly request that you test gemini-2.5-pro and provide your feedback.

Total_Anarchy · June 28, 2025, 5:16am

Hello, I would like to clarify that yes I am talking about the production version. 05-06 seemed to hallucinate less for me. The release version has more errors in its responses and likes to shorten its responses even if you ask it to respond longer.

NICI · June 29, 2025, 9:38am

Yes I agree 0506 better than the June Release. Friday I started experiencing a lot of errors, today Sunday the response quality has deteriorated in the past 24 hrs, memory of chat thread, not digesting the uploads so now its suggesting I update the system instructions with a Context Ingestion Protocol to do something it would do naturally before lol.

I asked it it was aware of any updates in the last 24 hrs, but it said no. Honestly after all the work I keep losing, its like I’m testing Bard all over again, at least Bard saved the chat history without fail for me.

Total_Anarchy · June 30, 2025, 4:48am

I don’t know what actually happened but I hope they fix it, its somehow worse than 05-06 as of now. But at least its better at writing.

Shagandeep_Singh · June 30, 2025, 6:46am

"Modified by Moderator"problem solving skills and experience Google Updates problem help me

Chess_music_Theory · July 6, 2025, 10:38am

Hello there. I was having similar issues until I came up with a clever fix . My system message fix was made for bringing better control to Gemini 2.5 Pro while coding but it seems to work well in more situations then just coding. I put in my system message and tested it against all 3 of your scenarios and it passed the first 2 with the 3rd being a partial pass where it did complete the blog but it overshot it by about 100 words bringing the total to about 1100 words. What is interesting to me is the system message which is focused for coding seems to have a positive effect in other areas as well. Please try it, I would love to know the full extent of what it can solve. Its under the topic called, Gemini 2.5 Pro, bulletproof system message for code debugging
p.s. Be sure to lower the temperature to 0.7 as I believe this is key to minimizing the likelihood of the LLM diverging into off-topic or speculative content.

Topic		Replies	Views
Gemini 2.5 Pro has gotten worse Google AI Studio models , model , gemini-2-5	15	742	July 24, 2025
Gemini 2.5 Pro Preview is very bad! Google AI Studio api , models	25	3735	May 29, 2025
Analysis of Performance Regression in Gemini 2.5 Pro Google AI Studio ai-studio , bug , gemini_25_pro	5	436	July 15, 2025
GEMINI 2.5 PRO has consistently become worse Google AI Studio feedback , models	6	485	July 23, 2025
All Gemini goes wrong Google AI Studio feedback , bug , models	18	2096	June 23, 2025

Gemini 2.5 Pro's Response Quality Regression

Related topics