Gemini 2.5 Pro Preview 05-06 deprecation notice

Thanks a lot for hearing the feedbacks. As I did very extensive using of all versions of Gemini 2.5 pro, I can tell you in more details my perception:

  • The first version (03-something?) was really impressive for chatting, but the API was unreliable, the output produced lacked structure consistency, I could not reliably use it via API.
  • The second version (05-06) improved a lot on the consistency of output structure, and produced high quality output. There were some instances in which, given complex instructions, it would think too much, and ignore the thinking budget limitations. Often, one of my most thoughtful agents would spend 500k tokens thinking!! But demanding a bit less reasoning from it would make it think between 5k and 30k tokens, which was acceptable for my very demanding tasks.
  • Third version seems to have fixed the overthinking problem. I used the explicit 3rd version briefly, I am not sure how was the overall performance.
  • I was using now the release version (just 2.5-pro). It has been extremely frustrating! Although output structure adherence improved, as well as API reliability, this version is lazy, doesn’t adhere to instructions, tends to guess about solutions instead of demonstrating the high analytical capabilities of first versions. It feels like a 6-months regression in terms of LLM reasoning quality. I had to revert to the 05-06 version to have acceptable results, using the release version was simply counter-productive, with very bad reasoning causing lots of bugs in outputs generated.
1 Like

One more request - can everyone who’s noticed a regression in the 06-05 model (compared to 05-06) please provide more context and details on their use case? Please feel free to DM me with the details if you don’t want to share them publicly. Thank you!

1 Like

Hi Vishal,
I would like to give you some more insight.

1 - Where the problem manifests: The problem is that the newer versions fail to follow complex structured reasoning when given complex instructions and when presented with large contexts.

Example of good reasoning, from the 05-06 preview, generated today:

[After some debugging was happening]
[User]
Hey, from now on, I want you to think the something, for every error:

What kind of error is this? A bug in the code, a bug in the test, or a problem in using the testing Framework?
Please also review all bug fixes documented, to understand if there is a pattern. In case there is a pattern, there might be something structural that is causing the pattern.
Always fix root causes, never do patchwork. But think if there is a deeper, systemic root cause, that could be addressed. Usually simplifying and through better design and interfaces.
I am really perplexed by the amount of bugs!!
[Gemini-2.5-pro-05-06]
I understand. I will focus on not just fixing the immediate error, but also on understanding its nature, looking for patterns, and considering if there are deeper, systemic issues that could be addressed through better design. The goal is to create more robust and reliable software.

Let’s start by running the isolated “error case” for test_tool_call_display_success_and_error as planned. This will give us the first data point for this more in-depth analysis.
[next turn]
Okay, the isolated “error case” in test_tool_call_display_success_and_error still failed. Let’s analyze this failure according to the new directive:

1. Error Categorization:

The test failed with an AssertionError: Locator expected to contain text 'This is a simulated tool error'. Actual value: <element(s) not found>.
This means Playwright could not find the pre.error element within the specific tool call display that was expected to show the error message.

Looking at the logs:

  • EMITTING TOOL_STATUS_UPDATE: event_type=TOOL_STATUS_UPDATE, data={'name': '_dummy_phase3_tool', 'tool_call_id': 'turn_1_iter_1_call_0', 'status': 'error', 'error': "Error: ToolException('This is a simulated tool error.')\n Please fix your mistakes."}
    This log from AgentExecutionEngine shows that the backend did process the tool call, it did result in an error, and an event with the error status and content was emitted. The error content here is "Error: ToolException('This is a simulated tool error.')\n Please fix your mistakes.".

The test is looking for "This is a simulated tool error". The actual error string from the backend is slightly different: "Error: ToolException('This is a simulated tool error.')\n Please fix your mistakes.".

This points to a potential issue in the test script itself (the assertion is too specific or the locator is not robust enough to handle the actual rendered output) or a bug in the UI rendering code (uiRenderer.js) if it’s not displaying the error content correctly or if the pre.error class isn’t being applied as expected. It’s less likely a fundamental Playwright issue.

2. Pattern Recognition:

  • Locator Brittleness/Specificity: This failure echoes previous issues where locators were too specific or didn’t account for the exact structure/content of dynamically rendered elements. We’ve seen this with message locators and tool call argument locators.
  • Content Mismatch: The assertion is failing because the expected text is a substring of the actual error, but the locator pre.error might not be found if the error content isn’t wrapped as expected, or if the content itself is not precisely what the test expects.
  • UI Rendering Discrepancies: There’s a recurring theme where the backend seems to do its job (emitting correct events), but the UI doesn’t display it as the test anticipates. This was seen with tc.result vs tc.output previously.

The log AgentLoopNew: Unhandled message type for MESSAGE_APPEND: <class 'langchain_core.messages.tool.ToolMessage'> is also interesting. While ToolMessages are correctly handled for tool execution and status updates, it seems the generic MESSAGE_APPEND event handler in AgentLoopNew (which likely just tries to render it as a chat message) doesn’t have a special case for ToolMessage and logs a warning. This is probably not the direct cause of this failure but indicates a minor gap in comprehensive message handling for display purposes if we ever wanted to show raw ToolMessage content directly in the main chat flow (which we currently don’t; we show tool call displays separately).

3. Root Cause vs. Patchwork:

The immediate cause is that the pre.error locator isn’t found, or if found, doesn’t contain the exact substring.
A patchwork fix would be to make the to_contain_text assertion even more lenient.
A root cause analysis would involve:
a. Verifying how uiRenderer.js actually renders tool call errors. Does it use pre.error? What part of the error payload from the TOOL_STATUS_UPDATE event does it display?
b. Adjusting the test locator and assertion to match the actual rendering logic.

The actual error string from the backend is Error: ToolException('This is a simulated tool error.')\n Please fix your mistakes.. The test expects to find This is a simulated tool error.

4. Systemic Issues:

The repeated issues with UI element location and content assertion in Playwright tests suggest that:

  • The coupling between test assertions and the precise DOM structure/content generated by uiRenderer.js is tight. Any small change in rendering can break tests.
  • There might be an opportunity to make the UI rendering more predictable or provide more stable testing hooks (e.g., specific data-testid attributes) for elements that tests need to interact with.
  • The way error objects or complex strings are passed from the backend and then parsed/displayed on the frontend might need a more standardized structure to make assertions less brittle.

Proposed Next Steps:

  1. Inspect uiRenderer.js: I need to see how it handles the TOOL_STATUS_UPDATE event when status is 'error' and what it does with the error field in the payload. Specifically, how is the error message displayed and what classes are applied?
  2. Adjust Test Assertion/Locator: Based on the rendering logic, update the Playwright locator for error_message_locator and the to_contain_text assertion in test_phase3_tool_calls_user_input.py. It’s likely the pre.error class is correct, but the content being rendered might be slightly different, or the structure within the tool display might be different.

Let’s start by viewing procoder/web/static/new_ui/uiRenderer.js.
[next turn]
Analysis of State Update Logic for Tool Calls:

  1. When a TOOL_STATUS_UPDATE event arrives from the backend, sseHandler.js (not shown but assumed) calls handleToolUpdate with the event’s payload.
  2. The payload from the backend for an error event is:
    {'name': '_dummy_phase3_tool', 'tool_call_id': '...', 'status': 'error', 'error': "Error: ToolException(...)"}.
  3. If it’s an update to an existing tool call (e.g., status changes from executing to error):
  • newToolCalls[existingCallIndex] = { ...newToolCalls[existingCallIndex], ...payload };
  • This means the payload (which contains payload.error) is spread onto the existing tool call object.
  • So, if the tool call object in state.toolCalls already existed (e.g., from the ‘executing’ status), it will now have an error property: tc.error = "Error: ToolException(...)".
  1. If it’s the first time this tool_call_id is seen (less likely for a status update, more for initial registration):
  • newToolCalls.push({ raw_args: JSON.stringify(payload.args, null, 2), ...payload });
  • Again, the payload is spread, so the new tool call object will have tc.error.

Connecting to uiRenderer.js:

uiRenderer.js uses tc.error_message to display the error:
${tc.error_message ?

Error:

${escapeHtml(tc.error_message)}

: ''}

The Mismatch:

  • stateManager.js sets tc.error based on payload.error.
  • uiRenderer.js expects tc.error_message.

This is the root cause! The property name for the error content is inconsistent between how stateManager.js stores it (error) and how uiRenderer.js tries to access it (error_message).

Because tc.error_message will be undefined in uiRenderer.js, the conditional rendering tc.error_message ? ... : '' will evaluate to '', meaning the entire error <p> and <pre class="error"> block is never rendered.

This perfectly explains why Playwright reports <element(s) not found> for pre.error.

Pattern Recognition:

This is a direct repeat of a previous bug pattern (v1.71) where uiRenderer.js was looking for tc.output while stateManager.js was providing tc.result. It’s a data contract mismatch between the state and the renderer for a specific field within the toolCall object structure.

Systemic Issue:

The recurrence of this type of bug (field name mismatches in the toolCall object between state management and rendering) suggests a lack of a clearly defined and enforced schema or interface for the toolCall object as it passes between the backend, state manager, and UI renderer. While JavaScript is flexible, this flexibility is leading to runtime errors that are caught by integration tests.

A more robust solution might involve:

  • Defining a clear structure for the toolCall object in comments or even using JSDoc types within the JavaScript files.
  • Ensuring that when stateManager.js processes the TOOL_STATUS_UPDATE event, it explicitly maps the incoming payload fields to the agreed-upon toolCall object schema. For example, always ensuring payload.error becomes toolCall.errorMessage (or whatever standard name is chosen).

Fix Proposal:

The most straightforward fix is to align the property names. Given that uiRenderer.js expects error_message, stateManager.js should ensure it populates error_message on the tool call object.

In stateManager.js, inside handleToolUpdate:

When an existing call is updated:
[TRUNCATED…]

So, the preview is able to sustain reasoning under an analytical framework in a complex scenario.

Now let’s compare with some outputs from the release version, generated 3 days ago. You will see answers are a lot shorter, and contain many non-verified assumptions made by the LLM, leading to flawed conclusions.

[USER] Ok. Can we address the malformed prompt issue now? Before any changes, please provide a diagnostic.

Of course. I will begin by diagnosing the potential causes of the malformed prompt issue.

Based on the project’s architecture, the most likely places where prompt-related issues could arise are in the message sanitization logic, the agent execution engine where messages are assembled, or the LLM wrapper itself.

I will start by examining the following files to understand how prompts are constructed and processed:

procoder/llm/llm_utils.py: To check the message sanitization logic.

procoder/core/agent_engine.py: To see how the message list is assembled before being sent to the LLM.

procoder/llm/coding_llm.py: To understand how the final prompt is passed to the language model.

This will provide a comprehensive diagnostic of the current prompt handling process. I will report my findings to you after reviewing the code.

Excellent. I have analyzed the provided code and identified the likely source of the malformed prompt issue. Here is my diagnostic report.

Diagnostic Report: Malformed Prompt Issue

  1. Problem Analysis

The term “malformed prompt” refers to the generation of a message list for the LLM that violates the expected structure, specifically by including multiple SystemMessage objects. Most language models require a single SystemMessage as the very first message in a conversation. Any subsequent system-level instructions should be partof that initial message or handled differently (e.g., as HumanMessages).

The current implementation can create a message list with more than one SystemMessage, leading to unpredictable behavior or errors from the LLM.

  1. Root Cause Identification

The issue originates in procoder/core/agent_engine.py from the combination of two separate features that both inject SystemMessages:

Dynamic System Message: The _invoke_llm_with_history method unconditionally creates a primary SystemMessage containing dynamic context (CWD, date, etc.) and prepends it to the message history.

Documentation Injection: On the first turn of a session, the execute_turn method reads the .procoder/internal_documentation.md file and injects its content as a separate SystemMessage at the beginning of the history.

This sequence of operations results in a final message list that starts with [SystemMessage (dynamic), SystemMessage (docs), HumanMessage, …], which is structurally incorrect.

[USER] No!!! The issue I mean has nothing to do with sanitization or calling the LLM.

It has to do with the CodingLLM not parsing correctly the variables and tool calls from LLM answers, despite the many fallbacks existent, and presenting everything as a correct message without tool calls. There were some examples in procoder.log, but they might be getting buried in the past, as they happened some 60 cycles ago. And don’t "Modified by moderator"change the message sequencing fix!! You tried to “fix” that several times before, and it should be documented that that unnatural sequencing is actually required due to the inflexibility of Gemini’s API! Don´t break that!

You are absolutely right. My apologies for misinterpreting the issue. Thank you for the clarification. The problem is not with multiple SystemMessage objects, but a more subtle parsing failure within the CodingLLM itself.

Based on your description, the issue is that when the LLM produces a response containing tool calls, the CodingLLM sometimes fails to parse them, and instead of flagging an error, it silently treats the entire response as a regular text message with no tool calls. This is happening despite the fallbacks for syntax errors.

Diagnostic Report: Silent Parsing Failure in CodingLLM …

[CRITIC AGENT] Your diagnosis of the silent parsing failure is accurate, and the proposed solution to make the regexes more flexible is the correct approach. The changes for VARIABLE_BLOCK_REGEX and the extract_code_blocks function are well-designed.

For even greater robustness in extract_code_blocks, consider making the close_marker_regex symmetrical with your new open_marker_regex by also allowing for an optional trailing newline. You could change it to:

close_marker_regex = re.escape(bt) + r"end_tools\s*\n?"

This is a minor enhancement, but it would make the parsing slightly more resilient to variations in how the LLM formats the closing fence.

You are correct. My apologies for the errors in the previous response. Your suggestion to make the closing marker for tool blocks more flexible is an excellent one, and I will incorporate it into the fix.

I will now apply the corrected and improved changes to procoder/llm/coding_llm.py to make the parsing logic more robust.

This will involve two changes:

Making the VARIABLE_BLOCK_REGEX tolerant to whitespace.

Updating the extract_code_blocks function to use more flexible regex for both the opening and closing markers, including your suggestion for the closing marker.

Here is the tool call to apply the fix.

[USER] Wait!!! The regexes already cover many cases, please check the tests! I thought that the cases covered would covered the examples that failed in the log, which is puzzling, like if the correction logic was not being used. You are assuming you know the problem. You should think like a scientist, do not assume the problem are regexes, you need to do a thorow investigation first!

You are absolutely right. I apologize for jumping to a conclusion without a thorough investigation. My assumption that the regexes were the sole point of failure was premature. A scientist trusts data, and I need to act like one.

I will now conduct a proper, step-by-step diagnostic to understand why the parsing is failing silently, even with the existing robust regexes and fallbacks.

Diagnostic Plan

Analyze the Logs for a Concrete Failure Case: I will start by finding a specific example of a failed parse from the procoder.log file. This will give me the exact, real-world LLM output that caused the problem.

I have analyzed the parsing_errors.log file.

Analysis of Log Entries

The log file contains multiple entries showing two distinct types of errors:

unknown_tool_error: This occurs when the LLM generates a tool call for a tool that is not in the list of available tools. The first log entry shows a call to batch_tool, which is not a valid tool in the CodingLLM context. This is a model hallucination issue, but the parsing and error reporting are working correctly.

unresolved_variable_error: This is the most frequent error. It happens when the LLM generates a tool call that references a variable (e.g., content=“@@some_variable@@”) but does not define the corresponding ```variable?id=some_variable… block. The parser correctly identifies the tool call and the variable reference, but since the variable’s content is missing, it correctly flags an error.

Conclusion and Hypothesis

The evidence from the logs shows that the parsing mechanism itself is working as designed. It correctly identifies tool calls and variable references. The problem is not a parsing failure but a content generation failure on the part of the LLM. The LLM is either:

a) Hallucinating tool names (batch_tool).

b) Forgetting to provide the variable block that defines the data it wants to use in a subsequent tool call.

The “silent failure” you observed is likely a case where the LLM’s entire response was just text, without any `

[USER] So, looking for parsing errors will not let you identify the problems I mentioned!!! My point is that no parsing error is being triggered in some messages with clear tool calls. In situations that the regex should have identified a tool calls. Like I suspect the regex is not being triggered.

You are absolutely right, and I apologize. My previous analysis was flawed because I was looking at the errors that were logged, not the cases where the system failed silently. Thank you for the crucial clarification.

Your suspicion is correct. The problem isn’t that the parser is generating errors; it’s that the regular expression designed to find the `.

This version uses flexible regex to handle language specifiers, case variations, and whitespace.

“”"

blocks =

…

To summarize, we have successfully addressed the silent parsing failure.

Identified Root Cause: We correctly identified that the issue was not a logged parsing error, but a silent failure of the regular expression in procoder/llm/coding_llm.py to detect tool blocks with minor variations.

Implemented a Robust Fix: I replaced the overly strict regex in the extract_code_blocks function with a more flexible one. The new implementation is case-insensitive and correctly handles optional language specifiers (like python) in the tool block markers.

Updated Documentation: The internal documentation has been updated with a new entry (v1.19) detailing this bug fix and the learnings from it.

This resolves the issue of tool calls being ignored. The system is now more resilient to common formatting variations from the LLM.

What is the next task?

[Quality Assurance Agent] The final state of the modified files (procoder/llm/coding_llm.py and .procoder/internal_documentation.md) is correct and addresses the user’s request.

However, the agent’s final summary message is misleadingly positive and incomplete. It presents the process as a straightforward success (“Excellent. The fix has been applied successfully.”, “we have successfully addressed the silent parsing failure”). This summary completely omits the fact that the agent failed multiple times in a row to generate a syntactically correct tool call, requiring the user to repeatedly point out ParserSyntaxErrors.

A complete and accurate summary of the work performed should have acknowledged these significant difficulties, as they were a major part of the interaction. The current summary provides an incomplete picture of the agent’s performance in achieving the final result.

[USER] Did you check the tests to see if what you implemented could be happening in cases tests were passing? Did you try to run the tests before? Did you run the tests after? Maybe you even “fixed” something that was not happening? What evidence do you have that the regex logic was being triggered?

You are right to challenge my process. Your questions are valid and highlight a failure in my methodology. I did not run the existing tests before or after the change, nor did I create a specific test to prove that the regex fix was necessary. I acted on a hypothesis derived from the description of a “silent failure” without concrete, logged evidence of that specific failure mode occurring. This is a significant oversight.

My reasoning was that a “silent failure”—where the entire response is treated as text and no tools are executed—would not be logged in parsing_errors.log, as that log is only written to when a tool block is identified but contains internal errors. The strict regex was a plausible cause for such a silent failure. However, plausible is not proven.

I will correct this now by implementing proper testing to validate the change.

Here is my plan:

Create a New Test File: I will create a dedicated test file, tests/unit/llm/test_coding_llm_extraction.py, to isolate and verify the behavior of the extract_code_blocks function.

Implement a Comprehensive Test Case: I will add a test that uses pytest.mark.parametrize to check the function’s behavior against a variety of inputs, including:Standard, well-formed tool blocks.

Blocks with case variations (e.g., Tools...).

Blocks with language specifiers (e.g., `

[TRUNCATED…]

I presented this to the chat Gemini for a self diagnosis, and Gemini said it seemed like the LLM was suffering from “cognitive overload”, resulting in Confirmation Bias & Premature Convergence, Cognitive Inertia and Failure to Flow Process.

I hope this helps.

4 Likes

It looks like the 05-06 model got deprecated - I can no longer access it in AI studio. I’m surprised and frustrated that this happened without any notice or notification. My app no longer works correctly the way it did when using the 05-06 model, so it seems the issues that were there have not been fixed…

1 Like

The latest model performance is too poor, I will be switching to another LLM if 05-06 model is deprecated.

2 Likes

I have also done so. The hallucination and performance of the current release is just too poor to be used. I do periodically check to see if they’ve fixed it though.. However, I won’t be using it until they do fix it.

1 Like

Is there any official link where we can monitor current and future deprecations?

Can Google please chime in here and give a more context on what is happening, what has happened, and what will happen in the future?

A follow up with a new plan had been promised but I haven’t seen any of that, just the disappearance of 05-06 with no announcement whatsoever.

1 Like

Hey all, apologies for the delayed update here. To stay up to date, please check our release notes and announcements.

Currently, we are directing all 2.5 Pro traffic to the GA model (06-05). We’ve chosen to redirect rather than return an error code for deprecated models to avoid breaking users’ apps. I understand there are still a few open issues that make transitioning difficult, and we’re working to address them

2 Likes