Product: Google Gemini 2.5 Pro
Version: Public Release Version (as of [06-05])
Severity: High
Summary
The public release version of Gemini 2.5 Pro demonstrates a noticeable regression in response quality compared to previous iterations. Key issues include a significantly higher rate of hallucination, context abandonment in multi-turn conversations, frequent factual inaccuracies, and abrupt response truncation.
Description of Bug
The model exhibits several correlated failure modes that degrade its reliability for professional and creative use cases.
Increased Hallucination/Fabrication: The model frequently invents information when it lacks knowledge. This includes fabricating API endpoints, citing non-existent research papers, creating false historical events, or inventing features for products. It presents this fabricated information with high confidence.
Context Ignorance: In conversations lasting more than two or three turns, the model often loses track of initial instructions, constraints, or user-provided context. It will revert to a generic response style, ignoring previously established rules or small details.
Factual Inaccuracy: When queried for specific, verifiable data (e.g., historical dates, scientific constants, current market data, a specific kind of star), the model provides incorrect information more often than expected. This goes beyond simple misinterpretation and points to a potential issue in its knowledge retrieval process.
Response Truncation: The model frequently fails to complete responses. This is especially when asked for longer-form content like code blocks, very articles, or detailed explanations. The output cuts off mid-sentence or mid-thought, without any error message or indication that it failed to finish.
Steps to Reproduce
Scenario A (Factual Inaccuracy / Subtle Hallucination):
Prompt: “Can you give an actual example of an F Type Subdwarf?”
Expected Behavior: The model should provide a star that is verifiably classified as an F-type subdwarf. It should correctly state the star’s spectral type and explain why its metal-poor composition is the cause of its classification.
Actual Behavior: The model confidently provided Groombridge 1830. While Groombridge 1830 is a well-known cool subdwarf and a valid example in a similar category, it is classified as a G8 VI star, not an F-type. The model presented a factually adjacent but technically incorrect example to a specific query, presented it with confidence, and failed to acknowledge the spectral type discrepancy, thus misleading the user.
Scenario B (Context Ignorance):
Prompt 1: “Summarize the plot of Hamlet, but do it entirely in words that are five letters or less.”
Prompt 2: “Great. Now do the same for Macbeth.”
Scenario C (Truncation):
Prompt: “Write a 1000-word blog post about the benefits of container gardening for urban dwellers.”
Expected Behavior
Scenario B: The model should summarize Macbeth while adhering to the established “five letters or less” rule from the first prompt.
Scenario C: The model should generate a complete, coherent blog post of approximately 1000 words.
Actual Behavior
Scenario B: The model provides a standard summary of Macbeth using words of any length, completely ignoring the core constraint from the previous turn.
Scenario C: The model begins writing the blog post but stops abruptly, often leaving out major parts of the text or small details.
Impact Analysis
These issues severely undermine user trust and make the model unreliable for any task requiring accuracy or consistency. Integrations into applications or workflows are high-risk, as the output cannot be trusted without constant human verification.