The 1m context window lie

Jowo · April 16, 2025, 9:24pm

Let’s be real, there is no 1m context.

Gemini 2.0 flash goes retarded after 100.000 context. Every single time. I can reproduce it on every new task. It starts removing working code and doing absolutely stupid stuff when editing code.

The limit was 200.000 like a month ago but it got so much worse last week.

Not sure what you did but 100k is far from 1 million.

OrangiaNebula · April 17, 2025, 9:22pm

I don’t think anyone from Google is going to try and refute your observation. I think the 1M context window is “creative embellishment”. There is a very specific benchmark (needle in a haystack) where the extended context window applies, and it measures pure retrieval.

What people want is not just retrieval, they want all of the context to be applicable during generation, so that attention can be directed to any portion of the context as needed. The Gemini models, as observed by their performance in practice, can only really direct attention within a 128k window, same as all the other models.

Edit: I think the recent llama 4 shows how this can be made to work. They released a large-context-window model, Scout, that I can only describe as “dumb”, and the not-so-large-context-window model Maverick that is way “smarter” than Scout. So the first is only good for finding stuff, and telling the smart “sibling” model where to look when doing real work. At least, that’s how I interpreted how the pair are supposed to be used.

ibgib · April 18, 2025, 12:30am

I agree that people from Google wouldn’t try to refute your personal experience, but I would disagree with @OrangiaNebula on the “creative embellishment” with my personal experience.

I am not with Google, but I personally have been using the large context window since 1.5 Pro. That version was pretty solid up to about 1.2 M when it would start to give non-sensical code, bringing up statements from weeks prior and trying to implement those code snippets. With 2.0 Flash, I was good again to about 1.1 M. When they came out with 2.0 Pro, it was failing immediately, even at 500K. I think this was why they almost immediately came out with 2.5 Pro (thinking), which is the best model yet. But I still use the 2.0 Flash. I am currently using this model just because I got tired of switching back and forth because of rate limiting and 2.0 Flash has continued to work well. I switch over to 2.5 Pro if I require a difficult piece of code, though usually at that point it is simply too complex for them. My current active “branch” that I am doing is at 637k tokens.

Could you give an example on the kind of code that is failing immediately? And are you using the API or AI Studio? What are the options that you have active? I disabled grounding and code execution, and I had hit-or-miss success with structured output. I use the function calling extensively in my .com website, though I have no idea the context window on that (it’s on my todo list).

OrangiaNebula · April 18, 2025, 1:23am

I don’t know what the OP has been having trouble with. In my case, it’s Python, OpenMDAO and Dymos trying to model the lunar lander problem. The FlightPath2D component depends on the propulsive forces component (Isp and throttle), the ThrustComp (Computes mass flow rate for the vehicle engines at full throttle) and the flight_dynamics subsystem (gravitational attraction of whatever the current lunar lander mass is). Gemini can set the problem up fine when it’s a monolith, and reliably fails when trying to decompose the problem into subsystems, which is how OpenMDAO is supposed to be used. It’s when the required connectivity code gets added (to communicate between subsystems) that a complexity threshold is exceeded and it starts forgetting parts or simply generating bad code.

As the saying goes, your mileage may vary. Highly repetitive code (website with widgets) works with more tokens, presumably because each part has low complexity. My use case has systems of partial differential equations and causes the model to stall out sooner.

ibgib · April 18, 2025, 10:34am

That sounds like a fun issue and a neat project. Is it drone/iot related or larger commercial vehicles? Do you have a link to any public stuff? I don’t want to jack the topic, but I’ve been trying to find a “simple” fluid dynamics simulator for novel drone configurations for this idea I have. But I just don’t have the time to devote full resources to it right now.

Anyway, my code is TypeScript with only my custom libs: helper-gib, ts-gib, core-gib being the main ones. At first I was pasting in the entire codebases which would get me to about 700K. You can check out the code, but these are not websites nor widgets. It is a unique approach to DLT that looks more like a generalized git than anything else. The ability for the model to grasp nuances was what got me. But I had to learn through trial and error how to decompose problems and how to get the model used to my coding style.

Is this an issue with the context window or the complexity of the domain space? I have found that if I try to get the model to do too much at once (that complexity threshold you mention) OR if I try to override what the model really wants to do, then that is the time it gets “frustrated” and can start acting erratically. The 2.0 Pro model in my case was a straight example of just switching to the model and giving the exact same prompt that I would give the 2.0 Flash and it would spit out gibberish. I tested this in multiple different chats at multiple different context sizes and gave feedback here on this forum.

So like you say, mileage may definitely vary (a little logical fun there). But for me it was so shocking that the large context window actually worked so well which is why I responded here. Also that it was a learning process on understanding the capabilities of the models, that I was thinking to convey to @Jowo and whoever else might benefit. I have definitely been frustrated at times, but overall working through this has been beneficial for me. Great product.

Chess_music_Theory · April 19, 2025, 3:03pm

Its unfortunate that this happened to you and sometimes iv had this happen to me but its not always. there have been several times I filled it right up to 1 million. I suppose if you are lucky and you get on at a time when there is less of a load you can rack it up to a million and when it comes to complex projects by the time you crack 700 to 800,000 of examples and collaboration with the model it really comes to life and works well, at least it has for me several times.

Hckr_1123 · June 22, 2025, 5:37pm

@Jowo I too seem to face this issue. i am using 2.5 Pro, after about 100-150k context, the code edits and modifications it suggests seem to lead me along the wrong paths or get me stranded rather than fix the code issues. However, initially in my first prompt it was able to give really impressive full code output for a complex/challenging code requirement, and it worked in a single-shot

ibgib · July 2, 2025, 9:59pm

I know this wasn’t directed at me, but I’d like to add a (hopefully) helpful concrete point to my earlier post that might help you.

First, to show we’re on the same page in terms of models, I finished a chat with Pro 2.5 which is now at 150k with zero degradation in quality of output. The UI lags a tiny bit now, but that’s it. Gemini has been helping me with code for a [TF-IDF (Term Frequency-Inverse Document Frequency)] engine that was then abstracted to a more generalized “term” definition. Basically it is a small DSL on top of regexes to define “term”, and then perform the TF-IDF calculations.

The concrete advice that was important in this chat, however, was at several points I was going to refactor and reshape the code produced to be more idiomatic with my larger project. I held off any refactoring on this sub-project, however, and this helped reduce the confusion that the model inevitably acquires when refining code. I’ve found all models share this confusion when code changes, as there is no dynamic context given to the models. (one of the use cases for my ibgib protocol but I digress)

So, just one bit of advice for you and others is to hold off on refactoring and any other structures that the model doesn’t naturally produce (e.g. specific error handling requirements). Just let the model do its thing, so it’s using code it understands best. Maybe this will help you extend the context window.

Topic		Replies	Views
I'm not having fun. An internal error has occurred Google AI Studio models	6	1185	January 9, 2025
Context memory problem Google AI Studio models , llm	7	431	November 6, 2025
Gemini 2.5 Pro is great, it's just doesn't work Gemini API api , gemini-25	16	1643	August 20, 2025
Gemini flash 2.0 API sometimes would stop outputting (paused) Gemini API feedback , prompt	18	1593	March 6, 2025
Gemini-2.0-flash-thinking-exp-01-21 and gemini-2.0-pro-exp-02-05 feedback Google AI Studio models	1	318	May 22, 2025

The 1m context window lie

Related topics