The 1m context window lie

Let’s be real, there is no 1m context.

Gemini 2.0 flash goes retarded after 100.000 context. Every single time. I can reproduce it on every new task. It starts removing working code and doing absolutely stupid stuff when editing code.

The limit was 200.000 like a month ago but it got so much worse last week.

Not sure what you did but 100k is far from 1 million.

1 Like

I don’t think anyone from Google is going to try and refute your observation. I think the 1M context window is “creative embellishment”. There is a very specific benchmark (needle in a haystack) where the extended context window applies, and it measures pure retrieval.

What people want is not just retrieval, they want all of the context to be applicable during generation, so that attention can be directed to any portion of the context as needed. The Gemini models, as observed by their performance in practice, can only really direct attention within a 128k window, same as all the other models.

Edit: I think the recent llama 4 shows how this can be made to work. They released a large-context-window model, Scout, that I can only describe as “dumb”, and the not-so-large-context-window model Maverick that is way “smarter” than Scout. So the first is only good for finding stuff, and telling the smart “sibling” model where to look when doing real work. At least, that’s how I interpreted how the pair are supposed to be used.

2 Likes

I agree that people from Google wouldn’t try to refute your personal experience, but I would disagree with @OrangiaNebula on the “creative embellishment” with my personal experience.

I am not with Google, but I personally have been using the large context window since 1.5 Pro. That version was pretty solid up to about 1.2 M when it would start to give non-sensical code, bringing up statements from weeks prior and trying to implement those code snippets. With 2.0 Flash, I was good again to about 1.1 M. When they came out with 2.0 Pro, it was failing immediately, even at 500K. I think this was why they almost immediately came out with 2.5 Pro (thinking), which is the best model yet. But I still use the 2.0 Flash. I am currently using this model just because I got tired of switching back and forth because of rate limiting and 2.0 Flash has continued to work well. I switch over to 2.5 Pro if I require a difficult piece of code, though usually at that point it is simply too complex for them. My current active “branch” that I am doing is at 637k tokens.

Could you give an example on the kind of code that is failing immediately? And are you using the API or AI Studio? What are the options that you have active? I disabled grounding and code execution, and I had hit-or-miss success with structured output. I use the function calling extensively in my .com website, though I have no idea the context window on that (it’s on my todo list).

I don’t know what the OP has been having trouble with. In my case, it’s Python, OpenMDAO and Dymos trying to model the lunar lander problem. The FlightPath2D component depends on the propulsive forces component (Isp and throttle), the ThrustComp (Computes mass flow rate for the vehicle engines at full throttle) and the flight_dynamics subsystem (gravitational attraction of whatever the current lunar lander mass is). Gemini can set the problem up fine when it’s a monolith, and reliably fails when trying to decompose the problem into subsystems, which is how OpenMDAO is supposed to be used. It’s when the required connectivity code gets added (to communicate between subsystems) that a complexity threshold is exceeded and it starts forgetting parts or simply generating bad code.

As the saying goes, your mileage may vary. Highly repetitive code (website with widgets) works with more tokens, presumably because each part has low complexity. My use case has systems of partial differential equations and causes the model to stall out sooner.

1 Like

That sounds like a fun issue and a neat project. Is it drone/iot related or larger commercial vehicles? Do you have a link to any public stuff? I don’t want to jack the topic, but I’ve been trying to find a “simple” fluid dynamics simulator for novel drone configurations for this idea I have. But I just don’t have the time to devote full resources to it right now.

Anyway, my code is TypeScript with only my custom libs: helper-gib, ts-gib, core-gib being the main ones. At first I was pasting in the entire codebases which would get me to about 700K. You can check out the code, but these are not websites nor widgets. It is a unique approach to DLT that looks more like a generalized git than anything else. The ability for the model to grasp nuances was what got me. But I had to learn through trial and error how to decompose problems and how to get the model used to my coding style.

Is this an issue with the context window or the complexity of the domain space? I have found that if I try to get the model to do too much at once (that complexity threshold you mention) OR if I try to override what the model really wants to do, then that is the time it gets “frustrated” and can start acting erratically. The 2.0 Pro model in my case was a straight example of just switching to the model and giving the exact same prompt that I would give the 2.0 Flash and it would spit out gibberish. I tested this in multiple different chats at multiple different context sizes and gave feedback here on this forum.

So like you say, mileage may definitely vary (a little logical fun there). But for me it was so shocking that the large context window actually worked so well which is why I responded here. Also that it was a learning process on understanding the capabilities of the models, that I was thinking to convey to @Jowo and whoever else might benefit. I have definitely been frustrated at times, but overall working through this has been beneficial for me. Great product.

Its unfortunate that this happened to you and sometimes iv had this happen to me but its not always. there have been several times I filled it right up to 1 million. I suppose if you are lucky and you get on at a time when there is less of a load you can rack it up to a million and when it comes to complex projects by the time you crack 700 to 800,000 of examples and collaboration with the model it really comes to life and works well, at least it has for me several times.

1 Like