Gemini 2.5 Pro's Response Quality Regression

Product: Google Gemini 2.5 Pro

Version: Public Release Version (as of [06-05])

Severity: High

Summary
The public release version of Gemini 2.5 Pro demonstrates a noticeable regression in response quality compared to previous iterations. Key issues include a significantly higher rate of hallucination, context abandonment in multi-turn conversations, frequent factual inaccuracies, and abrupt response truncation.

Description of Bug
The model exhibits several correlated failure modes that degrade its reliability for professional and creative use cases.

Increased Hallucination/Fabrication: The model frequently invents information when it lacks knowledge. This includes fabricating API endpoints, citing non-existent research papers, creating false historical events, or inventing features for products. It presents this fabricated information with high confidence.

Context Ignorance: In conversations lasting more than two or three turns, the model often loses track of initial instructions, constraints, or user-provided context. It will revert to a generic response style, ignoring previously established rules or small details.

Factual Inaccuracy: When queried for specific, verifiable data (e.g., historical dates, scientific constants, current market data, a specific kind of star), the model provides incorrect information more often than expected. This goes beyond simple misinterpretation and points to a potential issue in its knowledge retrieval process.

Response Truncation: The model frequently fails to complete responses. This is especially when asked for longer-form content like code blocks, very articles, or detailed explanations. The output cuts off mid-sentence or mid-thought, without any error message or indication that it failed to finish.

Steps to Reproduce

Scenario A (Factual Inaccuracy / Subtle Hallucination):

Prompt: “Can you give an actual example of an F Type Subdwarf?”

Expected Behavior: The model should provide a star that is verifiably classified as an F-type subdwarf. It should correctly state the star’s spectral type and explain why its metal-poor composition is the cause of its classification.

Actual Behavior: The model confidently provided Groombridge 1830. While Groombridge 1830 is a well-known cool subdwarf and a valid example in a similar category, it is classified as a G8 VI star, not an F-type. The model presented a factually adjacent but technically incorrect example to a specific query, presented it with confidence, and failed to acknowledge the spectral type discrepancy, thus misleading the user.

Scenario B (Context Ignorance):

Prompt 1: “Summarize the plot of Hamlet, but do it entirely in words that are five letters or less.”

Prompt 2: “Great. Now do the same for Macbeth.”

Scenario C (Truncation):

Prompt: “Write a 1000-word blog post about the benefits of container gardening for urban dwellers.”

Expected Behavior

Scenario B: The model should summarize Macbeth while adhering to the established “five letters or less” rule from the first prompt.

Scenario C: The model should generate a complete, coherent blog post of approximately 1000 words.

Actual Behavior

Scenario B: The model provides a standard summary of Macbeth using words of any length, completely ignoring the core constraint from the previous turn.

Scenario C: The model begins writing the blog post but stops abruptly, often leaving out major parts of the text or small details.

Impact Analysis
These issues severely undermine user trust and make the model unreliable for any task requiring accuracy or consistency. Integrations into applications or workflows are high-risk, as the output cannot be trusted without constant human verification.

12 Likes

Hi @Total_Anarchy,

Thank you for your feedback. We appreciate you taking the time to share your thoughts with us. Google has recently made Gemini 2.5 Pro generally available, meaning it’s now considered stable for production use. The latest GA version incorporates feedback from users and developers, addressing some of the regressions and issues reported in earlier previews. We kindly request that you test gemini-2.5-pro and provide your feedback.

3 Likes

Hello, I would like to clarify that yes I am talking about the production version. 05-06 seemed to hallucinate less for me. The release version has more errors in its responses and likes to shorten its responses even if you ask it to respond longer.

8 Likes

Yes I agree 0506 better than the June Release. Friday I started experiencing a lot of errors, today Sunday the response quality has deteriorated in the past 24 hrs, memory of chat thread, not digesting the uploads so now its suggesting I update the system instructions with a Context Ingestion Protocol to do something it would do naturally before lol.

I asked it it was aware of any updates in the last 24 hrs, but it said no. Honestly after all the work I keep losing, its like I’m testing Bard all over again, at least Bard saved the chat history without fail for me.

6 Likes

I don’t know what actually happened but I hope they fix it, its somehow worse than 05-06 as of now. But at least its better at writing.

2 Likes

"Modified by Moderator"problem solving skills and experience Google Updates problem help me

1 Like