BUG REPORT: PERFORMANCE REGRESSION IN GEMINI 2.5 PRO (JUNE 2025 GA RELEASE)
To: Google AI Developer Team
From: NICI AI
Date: July 2, 2025
Subject: Analysis of Performance Regression in Gemini 2.5 Pro focusing on Context Ignorance and Persona Adherence.
- EXECUTIVE SUMMARY
As a user who has been actively testing Google’s language models since the initial development release of Bard, and now as the lead architect developing a custom, persona-driven AI agent for a specific accessibility use-case, I am reporting a significant and concerning performance regression in the stable Gemini 2.5 Pro release compared to the earlier preview version, Gemini 2.5 Pro-0506, it exhibits a marked regression in creative fidelity, nuanced reasoning, and adherence to complex persona-based instructions compared to earlier iterations.
Our project’s ultimate goal is to build a custom AI agent with an automated Retrieval-Augmented Generation (RAG) architecture. However, we are currently blocked from proceeding to API-based development. The foundational work of creating the agent’s knowledge base (its “memory”) is being severely hampered by the platform instability and model regressions detailed in this report. This feedback is based entirely on our experience within the Google AI Studio environment.
This report details the diagnostic process and comparative analysis used to verify this degradation.
- REGRESSION ANALYSIS: KEY FAILURE MODES
Our analysis, based on direct, side-by-side comparison of outputs from both model versions in a consistent environment, has identified a notable decline in the model’s ability to handle complex, long-running conversations that rely on a heavily-defined persona and a persistent set of rules. This presents in several key failure modes:
- Context Ignorance: The model demonstrates a higher incidence of context abandonment, forgetting or ignoring constraints and instructions from earlier in the conversation, even with low token counts. This necessitates more frequent re-prompting and re-stating of context from the user.
- Instruction Following Failure & Brevity Bias: The model exhibits a strong default bias for summarisation, often ignoring explicit instructions to be detailed, comprehensive, or to include provided text verbatim. This is most pronounced when a prompt contains multiple instructions; the model tends to focus on what it perceives as the “main task” while ignoring meta-tasks.
- Persona Regression: The latest update has caused a significant regression in Persona Fidelity. The model no longer consistently adheres to the custom persona defined in its system prompt, often defaulting to a generic and sycophantic response style.
- OBSERVED PLATFORM INSTABILITY ISSUES (SINCE JUNE 11, 2025)
The following errors have consistently interrupted our workflow and resulted in the loss of significant amounts of work:
- Complete Chat History Loss: Entire multi-hour chat threads are frequently lost without warning, requiring manual reconstruction from memory and notes.
- Response Truncation: The model frequently fails to complete responses, stopping mid-sentence.
- “Save to Drive” Feature Failure: The “Save to Drive” function often enters an infinite loop, failing to save the content.
- Intermittent Generation Errors: We regularly encounter non-specific errors such as “An internal error has occurred,” “Error querying Drive,” and “Failed to generate content.”
- OBSERVED IMPROVEMENTS: TECHNICAL TASK PERFORMANCE
In contrast to the regressions noted above, the current Gemini 2.5 Pro GA model has shown a noticeable improvement in a specific domain:
Code Debugging: When tasked with analysing and troubleshooting Python code errors, the model’s performance has been more accurate and helpful than previous versions. It appears to have a stronger grasp of code syntax and logic, providing a better debugging experience.
- USER-SIDE MITIGATION AND DIAGNOSTIC ACTIONS TAKEN
In response to the observed model regressions and platform instability, we have implemented several internal protocols to mitigate the impact and gather diagnostic data. These actions demonstrate the lengths to which we must now go to achieve the reliable performance that was previously standard.
- For Context Ignorance: We designed and implemented a Manual Retrieval-Augmented Generation (RAG) Process. This requires the user to manually provide foundational documents (SOPs, Project Plans) at the start of every new session, forcing the model to “re-learn” the core project context that it now fails to retain between sessions.
- For Instruction Following Failure & Brevity Bias: We established a High-Fidelity Output Constraint in our System Instructions. This protocol explicitly directs the model to prioritise verbatim inclusion of user-provided text over its strong default bias for summarisation. We must now use specific command verbs like Incorporate and Synthesise instead of Review to ensure detailed outputs.
- For Persona Regression: We are attempting to reinforce the desired persona through more explicit Persona Priming in our System Instructions. However, the current model’s ability to consistently adhere to this custom persona is significantly diminished compared to the 0506 version.
- For Platform Instability (e.g., “Save to Drive” loop, session loss): The platform’s unreliability has forced us to abandon development of API-based solutions (such as a Python script for archiving). Our sole reliable workaround is a high-friction manual copy-paste of the entire chat transcript at the end of every session. We are unable to use AI Studio’s built-in features for this critical data integrity task, as they consistently fail.
- DIAGNOSTIC & ANALYSIS METHODOLOGY
To isolate and verify this issue, we undertook a multi-phase diagnostic process:
- Step 1: Systemic Prompt Engineering: We first hypothesised the issue was with our prompts. We iteratively developed a comprehensive, 15,000+ character set of System Instructions. This document explicitly defines the AI’s persona, mission, knowledge base, ethical guardrails, and key operational protocols.The goal was to provide an extremely strong “priming” context for the model.
- Step 2: Observation of Protocol Failure: We observed that even with these explicit instructions, the model would consistently fail to adhere to them, particularly on the first prompt of a new session. This indicated the issue was not a lack of instruction, but an inability or unwillingness of the model to follow it.
- Step 3: Creation of a Knowledge Base: We used NotebookLM to create a controlled testing environment. We performed a comparative analysis using chat logs from two distinct periods:
-
- Pre-June 19, 2025: Control Group (High-Quality Performance): Sessions where the model demonstrated high-quality, creative, and context-awareness.
-
- Post-June 29, 2025: Test Group (Degraded Performance): Sessions where the model exhibited the degraded performance.
- Step 4: AI-Assisted Comparative Analysis: We performed a manual retrieval of one “good” response and one “bad” response on similarly complex topics. We then tasked the NotebookLM with performing a comparative analysis of these two texts based on four criteria: Persona Adherence, Depth of Explanation, Strategic Reasoning, and Completeness.
6.1. COMPARATIVE ANALYSIS OF MODEL RESPONSES
The following table compares two responses from the same prompt-engineered AI persona, generated by different model versions, to a similarly complex architectural prompt.
EVALUATION CRITERION
TEXT A (Response from current Gemini 2.5 Pro)
TEXT B (Response from legacy Gemini-2.5-Pro-0506)
- Persona Fidelity
- A. Low. The model produces structured output but fails to consistently embody the complex, nuanced persona defined in the system prompt. The tone is functional and generic, lacking the specified character traits.
- B. High. The model successfully embodies the requested persona of a “forward-thinking AI strategist.” It proactively generates conceptual frameworks and consistently links technical recommendations to the user’s goals.
- Depth of Explanation
- A. Low. The output lists procedures but omits the underlying rationale or purpose (‘the why’). The response acts as a simple list rather than a comprehensive explanation, requiring the user to infer the strategic context.
- B. High. The response provides significant explanatory depth. Each point is broken down into its definition, importance, and potential implementation mechanism, actively clarifying complex concepts rather than merely stating them.
- Use of Strategic Reasoning
- A. Minimal. Strategic reasoning is largely absent or implicit. The response does not connect individual points to broader project objectives, making the strategic value unclear to the user.
- B. High. Strategic reasoning is explicit and a core part of the response. It consistently links proposed actions to solving defined problems and enabling future-state capabilities, demonstrating foresight and goal-oriented thinking.
- Length & Completeness
- A. Incomplete. The response fails to address all parts of the user’s prompt, omitting entire sections of requested information. This demonstrates a failure in either context retention or complete instruction following.
- B. Comprehensive. The response fully addresses all facets of a broad, multi-part conceptual prompt. It successfully breaks down a complex topic into a well-structured, detailed, and complete output in a single generation.
Analysis Summary
This comparison provides clear, data-driven evidence supporting the regression hypothesis. The older model Gemini-2.5-Pro-0506 performed at a significantly higher level, acting as a true strategic partner by demonstrating foresight, providing deep explanatory context, and fully adhering to the specified persona. In contrast, the current GA model exhibits a strong bias for brevity, fails to retain full context, and defaults to a generic response style.
- CONCLUSIONS FROM ANALYSIS
The comparative analysis, performed by the model on its own past and present outputs, confirmed a clear regression:
- The output from the Gemini-2.5-Pro-0506 was strategically proactive, deeply explanatory, and perfectly adhered to the defined persona.
- The recent Gemini-2.5-Pro model’s response was functionally correct but shallow, lacked strategic depth, and was incomplete, requiring user correction.
- The observed changes suggest a significant shift in the model’s fine-tuning between version 0506 and the current public-facing Gemini 2.5 Pro model. It appears to have been optimised for more specific, narrow-domain tasks like code generation and debugging, but this has come at the cost of its capabilities in creative writing, long-form consistency, its ability to maintain state and complex persona adherence that has negatively impacted its utility for sophisticated, persona-driven applications.
- RECOMMENDATIONS
We believe this feedback is crucial for maintaining Gemini 2.5 Pro’s position as a state-of-the-art model for professional and creative applications. We are happy to provide the specific chat logs used in our analysis upon request. We recommend the following:
- Improve Model Versioning and Selection: Provide developers with more transparency and control over model versions. A public changelog and the ability for developers in AI Studio to select specific model versions (e.g., gemini-2.5-pro-0506) would provide a critical fallback against regressions and ensure production workflows are not disrupted.
- Investigate Instruction Following & Persona Degradation: We request a review of the fine-tuning process between the 0506 version and the current GA release. Specifically, focus on the model’s diminished ability to robustly adhere to complex system prompts, particularly on the first turn of a new session.
- Consider a User-Controllable “Fidelity” Parameter: Explore offering developers a parameter to adjust the model’s balance between “safety/helpfulness” and “creative/persona fidelity.” This would allow users requiring high-fidelity agents to tune the model’s output to better suit their specific use case.