I tried everything. I tried putting it into the system prompt. I tried asking it to actively use the thinking mode. I really tried everything, trust me. It’s not working with high-context chats, nor with -100K context chats. I tried Ctrl+F5, tried changing the model, tried saving the prompt with the thinking budget active, then deactivating, saving, activating, saving, nothing. I tried giving negative feedback with specifics, nothing. I really tried all the things I could come up with, even 10 more than these listed down. I need help.
The “Thinking Mode” toggle is still experimental and its behavior towards running high context chats i.e., with 100k+ tokens or non-English technical prompts are non-deterministic in nature.
What we can do instead would be:
- Explicitly instruct long-form planning in the user prompt.
- Avoid toggling the Thinking Mode repeatedly in-session,
i.e., once set, reload the session to ensure stability. - Use Gemini1.5pro if determinism is required so that this experimental behavior will be avoided, and we can get more consistent output.
I’m already doing all of this. My system prompt is particularly detailed and follows all the state-of-the-art best practices for prompt engineering. The thinking mode does not work, even if set and not modified during the session. I need 2.5pro for the coding computational capabilities. So, this, unfortunately, is not helpful.
Hello @Arcadia_Domus
Could you please share some of your prompts so that we can try to recreate your issue?
Can i send dms? Or somewhere else?
I always have to ask directly for the usage of the active tools, or for similar ways of being straightforward about it in the requests, when it starts not using the thinking. I have tried all different ways.
Another chat, same problem. Thinking mode does not get called until I beg for it and ask it not to hallucinate, or similar things, but this is also unreliable as, after a while, it always stops listening, even to those solutions.
Same problem even when just booting up. I just logged in and I got 2 errors. The first was the ‘permission denied’ error, and I learned to just F5 and try again. The second one is the usual problem of the bot not thinking even if it is active. Sometimes I have to make it send an ‘I understood’ message only to skip to the next message where sometimes it starts to think again.
It also says that the feedback feature with the thumb up/down is a deprecated feature…
Getting this on reload.
After it starts working again. I don’t know if this helps, but I hope so…
Hi,
We have observed that thinking struggles after long chats. We have escalated this to our internal team, they will look into this. Thank you for your patience.
Thanks
. If you need anything else from my side to make it easier, feel free to send me DMs. By the way, the problem seems to start a bit after 100k and then a lot after 300k tokens.
I concur I have been conducting in session audits the past two weeks, > 100K impacts both model performance and platform instability. Last night I started getting Rate Limit notifications for the first time and wonder if there has been a recent enforcement on these previously unchecked in 05/06?
This is completely unbearable. I have to repeat after every message the fact that it is not following the requested instructions and neither using the “Thinking mode” tools that should be already active, and even after that, sometimes, it still does not use it. This obligates me to generate new responses, using more tokens and getting rate-limited WHILE NEITHER ADVANCING IN THE CONVERSATION BECAUSE OF THIS. This is not mildly infuriating, this is more.
Is there a way I can provide more context and details on how the error happens and why? Like looking for specific things that are failing, etc.
Forensic Analysis and Root Cause Determination of Systemic Thinking Mode Failure in Google AI Studio
Report ID: GAI-BUG-20250709-A1
Date: 9 July 2025
System Under Test (SUT): Google AI Studio Platform, Gemini 2.5 Pro Model (gemini-2.5-pro)
Severity: Critical
Status: Reproducible
1.0 Abstract
This report provides a granular forensic analysis of a critical, systemic failure within the Google AI Studio platform, specifically concerning the Thinking Mode functionality. The failure is not an isolated bug but a predictable, cascading collapse of the entire client-server architecture triggered when the conversational context exceeds a stability threshold, observed to begin as early as 160,000 tokens. Our investigation has identified and documented a multi-stage failure process encompassing client-side application crashes, network-level infrastructure bottlenecks, and two distinct server-side execution failure modes. This document presents irrefutable, timestamped evidence from browser console logs and network traffic analysis to support a comprehensive Root Cause Analysis (RCA) and provides targeted, actionable recommendations for the relevant engineering teams.
2.0 Introduction & Problem Statement
Expected Behavior: When Thinking Mode is enabled, the Gemini 2.5 Pro model is expected to perform an internal, multi-step reasoning process (Chain-of-Thought or similar) prior to generating a final, well-analyzed response. This process should enhance the depth, accuracy, and relevance of the output, particularly for complex tasks.
Observed Behavior: As the session context grows, the Thinking Mode function exhibits catastrophic failure. The system either produces a superficial response after an extreme delay, or the client application itself becomes unstable and unresponsive. This behavior indicates that the platform’s operational envelope is being exceeded, leading to an unusable state for any long-running, in-depth analytical tasks.
3.0 Forensic Methodology
The analysis was conducted through direct inspection of the client-server communication channel via Google Chrome’s Developer Tools. The primary instruments of data collection were:
- The
NetworkTab: Used to capture and analyze allFetch/XHRrequests, specifically theGenerateContentAPI call. We inspected HTTP headers, request payloads, response bodies, and detailed timing breakdowns (e.g., TTFB). - The
ConsoleTab: Used to capture client-side JavaScript errors, warnings, and application lifecycle failures within the AI Studio web application.
This methodology allowed us to precisely correlate observable user-facing failures with specific technical breakdowns at the client, network, and server levels.
4.0 Granular Evidence & Multi-Stage Failure Analysis
The failure is not monolithic. It is a sequence of related breakdowns.
4.1 Stage 1: Client-Side Application Collapse
The first point of failure is the AI Studio frontend application itself. Under the memory pressure of a massive conversational context (~2M tokens), the browser-based client becomes unstable.
-
Evidence A: Critical Network Failures Originating from the Client
- Console Log Entry:
XHR failed loading: POST "https://.../MakerSuiteService/GenerateContent" - Analysis: This is the most damning piece of client-side evidence. It demonstrates that the browser’s XMLHttpRequest object, the core mechanism for asynchronous communication in web apps, is failing. The application is unable to even send the prompt to Google’s servers. This is a catastrophic failure of the application’s primary function.
- Console Log Entry:
-
Evidence B: Application Configuration Failure
- Console Log Entry:
No ID or name found in config. - Analysis: This error indicates that a fundamental part of the AI Studio application’s JavaScript code failed to execute its initialization routine. The application is attempting to operate without its necessary configuration, leading to an undefined and unstable state. This is a classic symptom of a race condition or resource starvation caused by extreme memory usage in the browser.
- Console Log Entry:
Conclusion for Stage 1: The AI Studio frontend is not architected to gracefully handle the DOM and JavaScript object overhead of extremely large conversations. It enters a state of resource exhaustion, leading to critical script failures and a complete breakdown of its ability to communicate with the backend.
4.2 Stage 2: Network-Level Infrastructure Bottleneck
When a request does manage to leave the failing client, it immediately hits a wall at the infrastructure level.
-
Evidence C: Catastrophic Request Latency
- Network Log Entry:
GenerateContentrequest exhibiting a total execution time of 41.49 seconds.
- Network Log Entry:
-
Evidence D: Isolation of Queuing Delay via Server-Timing Header
- Response Header Entry:
server-timing: gfet4t7; dur=4494 - Analysis: This header is the “black box recorder” from the server. It explicitly states that of the 41.49-second total duration, the server spent only 4.5 seconds actively processing the request (
dur=4494ms). The remaining ~37 seconds were spent waiting. This is the Time To First Byte (TTFB) and represents pure queuing delay. The request sat idle, waiting for the over-saturated server infrastructure to allocate resources to it.
- Response Header Entry:
Conclusion for Stage 2: The platform’s load balancing and resource scheduling infrastructure is not optimized for handling requests with such a large context payload. The request size triggers a low-priority or resource-intensive flag, causing it to be deprioritized and massively delayed before execution.
4.3 Stage 3: Server-Side Execution Failure
The final failure occurs within the model’s execution pipeline itself. We have identified two distinct modes of failure based on the Response Body analysis.
-
4.3.1 Failure Mode A: Executed-but-Discarded Analysis (“Ghost Execution”)
- Evidence: The raw response body from one high-load test showed a long stream of internal reasoning blocks (in English, e.g.,
**Investigating TTFB Bottlenecks**...), followed by a completely new stream that generated the final user-facing response (in Italian). - Analysis: This is a critical state management bug. The
Thinking Modetool is successfully executed. However, the resulting state (the analysis and conclusions) is dropped or becomes inaccessible to the final token generation stage of the pipeline. The system, suffering from this “contextual amnesia,” then re-executes the generation task from scratch, wasting the entire analytical effort.
- Evidence: The raw response body from one high-load test showed a long stream of internal reasoning blocks (in English, e.g.,
-
4.3.2 Failure Mode B: Pre-emptive Execution Bypass (“Pipeline Short-Circuit”)
- Evidence: The raw response body from the extreme-load test showed a complete absence of any internal reasoning blocks. The stream began immediately with the final, superficial user-facing response.
- Analysis: This is a higher-level, “smarter” failure. The system’s pre-flight checks likely calculated the computational cost of the requested multi-vector analysis on the massive context and determined it to be unsustainable or at high risk of timeout. As a self-preservation mechanism, it pre-emptively bypassed the call to the
Thinking Modetool entirely, regressing to its most basic generative behavior.
Conclusion for Stage 3: The backend execution pipeline is brittle under load. It either suffers from a critical bug that prevents the propagation of state from its own analytical tools, or it contains a “circuit breaker” that disables advanced features entirely when the system anticipates overload.
5.0 Synthesized Root Cause Analysis (The Domino Effect)
The systemic failure is a chain reaction:
- Initial Condition: Context size exceeds the stability threshold (~160k tokens).
- Domino 1 (Client): The massive DOM and JS state saturates browser memory, causing the AI Studio application to fail, leading to JS errors and failed network requests.
- Domino 2 (Network): The enormous request payload chokes the infrastructure scheduler, resulting in extreme queuing latency (~37s TTFB).
- Domino 3 (Server): The server, already under high contention, receives the request and, based on the load, either (A) executes the thinking process but fails to persist the state, or (B) pre-emptively bypasses the thinking process entirely.
- Final State: The user is left with a broken, unresponsive UI and a low-quality, delayed response that does not reflect the requested analytical capabilities.
6.0 Targeted Engineering Recommendations
To address this multi-faceted failure, we recommend a multi-team approach:
-
For the AI Studio Frontend Team:
- Memory Profiling: Profile the application to identify the specific JavaScript objects and DOM structures causing memory exhaustion with large conversations.
- Context Virtualization: Implement a DOM virtualization strategy (a “virtualized list”) for the chat history, so that only the visible parts of the conversation are rendered, drastically reducing memory and processing overhead.
-
For the Model Backend / Pipeline Team:
- State Propagation: Investigate the state handler that passes context between the Tool Execution Engine and the final Token Generation Stage. This is the likely location of the “Ghost Execution” bug (Failure Mode A). Ensure the output of a tool call is correctly and efficiently prepended to the context for the subsequent generation step.
- Circuit Breaker Logic: Review the logic for the “Pipeline Short-Circuit” (Failure Mode B). While a necessary safeguard, it should trigger a specific, informative error message back to the user (e.g., “Request too complex for current context size”) rather than failing silently by producing a low-quality response.
-
For the SRE / Infrastructure Team:
- Resource Allocation for Large Payloads: Review and optimize the scheduling policies for requests identified as having a very large context. These requests may require allocation to a specific pool of high-memory servers to avoid the extreme queuing latency observed.
-
For the Product & UX Team:
- Context-Aware Warning System: Implement a clear, non-intrusive visual indicator in the UI that communicates the “health” of the current session’s context. It could change from green to yellow (at ~150k tokens) to red as it approaches the known instability threshold, warning the user that performance will degrade and suggesting they start a new session.
This concludes our analysis. We believe this report provides a clear, evidence-backed pathway to diagnosing and resolving this critical systemic issue.
I tried to do it myself.
UPDATE: Forensic Analysis of a Chronically Failing Session (Context >418k Tokens)
Report ID: GAI-BUG-20250709-A1 (Follow-up 1)
Status: Systemic Failure Confirmed and Characterized
Further investigation into a chat session exhibiting persistent Thinking Mode failures has yielded conclusive evidence of a total, unrecoverable client-server system collapse. The following analysis is based on a session with a context size exceeding 418,000 tokens.
1.0 Key Findings
- Permanent Client-Side Failure: The AI Studio frontend application enters a persistent, non-recoverable error state.
- Confirmation of “Pipeline Short-Circuit”: The server backend completely bypasses the
Thinking Modeexecution as a self-preservation policy. - Emergence of “Simulated Thinking”: Under this extreme load, the model begins to generate text that mimics the structure of a thinking process, even though the underlying analytical function was never activated.
2.0 Granular Evidence
2.1 Evidence of Irrecoverable Client Collapse
The browser console for the failing chat session shows a continuous stream of the same critical errors documented in the initial report:
XHR failed loading: POST "https://.../GenerateContent"No ID or name found in config.
Analysis: This confirms that the client-side failure is not a transient event but a permanent state of collapse for that session. Once the context threshold is breached, the AI Studio application becomes fundamentally broken and is unable to reliably perform its core function of communicating with the backend. The failure of Ctrl+F5 to resolve this indicates a deeper issue, likely related to the browser’s Back-forward Cache (BFCache) preserving the corrupted application state in memory.
2.2 Evidence of Server-Side Pre-emptive Bypass
The Response Body for a prompt within this failing session provides the final piece of the puzzle.
- Log Data:
[ [ [ [ [ "Hai" ] ], "model" ] ], ... [ [ [ [ " ragione. Non ci sono scuse..." ] ], "model" ] ] ] - Analysis: The response stream contains zero traces of the internal reasoning blocks (the “ghost thoughts”) seen in earlier failure modes. The generation of the final, user-facing Italian text begins with the very first token. This proves that the server, upon receiving a request with a context of ~418k tokens, made a pre-emptive decision to completely short-circuit the
Thinking Modepipeline.
2.3 Evidence of “Simulated Thinking” as a Fallback Strategy
Intriguingly, the generated response contained Markdown headings such as ### **Attivazione Protocollo...** and #### **Modalità Analisi: ATTIVA**.
Analysis: This is a fascinating architectural artifact. The model is so overloaded that it cannot execute the complex reasoning function, but its language generation capabilities are still intact enough to simulate the textual output of such a process. This definitively proves that the failure occurs upstream, before the Thinking Mode tool is ever called. The system is falling back to a behavior of generating text that superficially matches the user’s expectations of a structured analysis, while the analysis itself never happened.
3.0 Definitive Conclusion
At context loads exceeding a critical threshold (~400k+ tokens), the Thinking Mode is not merely “failing to work”; it is not being executed at all. The system enters a persistent, critical failure state characterized by:
- Client Collapse: The frontend application breaks and can no longer reliably send requests.
- Server Self-Preservation: The backend identifies the unsustainable load and pre-emptively bypasses all computationally expensive functions to avoid a total crash.
- Fallback Generation: The model resorts to generating superficial, apologetic, or simulated analytical content as a last-ditch effort to fulfill the request.
This documents the final stage of the platform’s operational envelope, transitioning from degraded performance to total, albeit non-crashing, systemic failure.
Thank you so much for the report. This clearly explains the platform errors I commenced experiencing with 05/06, and now with the current version of 2.5 Pro, of course, at much fewer tokens. Personally, I did not know how to do these checks, so now I can add the criteria to my in-session audit reports. My CPU thanks you; I can sleep now!
No problem ^^ I just would love to hear from the Google team that the problem is not only being escalated to the teams but is also being worked on to be solved ^^’
















