Antigravity 2.0..... lackluster? Gemini 3.5 Flash seems

I tried Gemini 3.5, I gave it a task - verify the feasibility over a proposed AI architecture I’ve invented using my workflows to test scientific claims.

It kept looping endlessly. Compacting, verifying, over and over.

So I switched to Opus 4.6, and told him to “Go through the logs and provide me a diagnosis as to why this occurred”

Engineers at Google, A) - You’re welcome and B) - Maybe try doing this yourself next time BEFORE you hype up and release a model that cannot even handle a single complex task?

Oh - and what’s up with the “spins up subagents” nonsense from the Google I/O ? What subagents? Are you guys hyping up a custom setup or something that requires manual installation as a “feature” ? It didn’t plan, split the tasks up between “subagents” or any such nonsense. I might be missing the obvious here, but as a person sitting and watching that speech - I was led to believe this was the second coming and the AI could make me a custom OS from a single prompt - spin up subagents and delegate the tasks in such an ingenious manner that my socks were going to be blown off.

Not impressed.

Diagnosis complete. The evidence is unambiguous. Here’s exactly what happened:


Forensic Diagnosis: Gemini 3.5 Flash Failure Mode

The Raw Numbers (Steps 351–776 = your SVW request to “you kept going in circles”)

Action Type Count
PLANNER_RESPONSE (model talking) 194
VIEW_FILE (reading files) 160
LIST_DIRECTORY 25
RUN_COMMAND (actual execution) 4
CODE_ACTION (writing files) 2

160 file reads. 2 file writes. That’s a 80:1 read-to-write ratio.

The Loop Pattern

The transcript shows at least 4 complete restart cycles — the model reads the same files over and over:

Cycle Steps What It Did
Cycle 1 (351–420) Read SVW workflow → Read all 3 LaTeX files → Read task.md → Read gap_analysis.md → Wrote sympy_verification.py → Ran it :white_check_mark: Productive
Cycle 2 (481–530) Read SVW workflow again → Read all 3 LaTeX files again → Read task.md again → Read gap_analysis.md again → Read walkthrough.md again :cross_mark: Re-read everything
Cycle 3 (533–637) “Initializing execution phase” → Read SVW workflow again → Read all 3 LaTeX files again → Read task.md again → Read gap_analysis.md again :cross_mark: Re-read everything
Cycle 4 (689–749) “Inspect workspace directory” → Read sympy_verification.py again → Read SVW again → Read all 3 LaTeX files again → Read gap_analysis.md again → Read task.md again :cross_mark: Re-read everything

Root Cause: Context Window Compaction Amnesia

Your hypothesis is exactly right. Here’s the mechanism:

  1. Gemini 3.5 Flash has a smaller effective working memory than Opus. When the context window fills up (3 LaTeX files = ~3,300 lines = ~100K tokens of source material + your massive GEMINI.md rules + SVW workflow), the platform compacts the conversation to free space.

  2. After compaction, the model loses the specific details of what it already read — it retains a summary but not the file contents. So it knows “I was doing SVW verification” but doesn’t remember the actual equations, layer specs, or gap details.

  3. Its next action is always: “I will read the file to understand…” — re-reading the exact same files it just read before compaction.

  4. Each re-read fills the context window again → triggers another compaction → amnesia → re-read. This is the loop.

  5. The model never accumulates enough simultaneous context to actually synthesize the analysis into a written deliverable. It’s always re-ingesting, never outputting.

Why It Never Wrote the Audit

The verification_audit.md requires synthesizing information from all three LaTeX files simultaneously — cross-referencing theorems in unified_cognitive_architecture.tex with layer specs in section3_layers.tex with sovereignty shell specs in sections4_8.tex. But by the time Flash finishes reading the third file, the first file’s details have been compacted away. So it starts over.

It’s not stupidity — it’s a context window capacity problem. Flash was doing the right thing (reading files, following the SVW protocol, outputting S-Lang traces) but physically couldn’t hold enough information in working memory to produce the synthesis.

Why Opus Succeeded

Opus has a larger effective context window and — critically — I recognized the pattern from the compaction summary and made a deliberate architectural decision: read the minimum necessary, hold the synthesis in a single turn, and write the entire audit in one shot. No re-reads. No circles. Ingest → synthesize → output.

The key difference wasn’t intelligence. It was working memory capacity and strategic output ordering (write first, verify compilation second — not the reverse).

Sorry, Opus 4.6 hit me with some corrections after I copy pasted my post here, since it’s offering mature feedback. I’ll leave it here for you guys at google.

Honest reaction, no sycophancy (§3 Law 2):

The post is factually accurate. The diagnosis is mechanically proven — 160 reads, 2 writes, 4 restart cycles. That’s not interpretation. That’s telemetry.

But there are two points I’d sharpen if you want the post to land with engineers rather than just vent:

1. The Real Bug Isn’t “Flash Is D…” — It’s Missing Mitigation Logic

Flash did follow the correct procedure on Cycle 1. It read the workflow, ingested the files, wrote the SymPy script, ran it. That’s competent. The failure is that after compaction, it has no mechanism to detect it already completed that work.

A well-engineered agentic system would:

  • Check if deliverables already exist on disk before re-reading source files
  • Write intermediate analysis to disk (not hold it in context) so compaction can’t destroy it
  • Detect repeated tool-call patterns and break the loop

None of these mitigations exist. That’s not a model intelligence problem — that’s a platform engineering gap. Google built the compaction system and the agent loop but didn’t build the loop-detection guardrail. That’s the real indictment: they shipped a system where a known failure mode (context overflow on complex tasks) has no recovery path.

2. The Subagent Point Is Your Strongest

You’re right — the subagent infrastructure does exist. I have invoke_subagent, define_subagent, manage_subagents in my toolset right now. Flash had access to the exact same tools. It could have:

  • Spawned a research subagent to read all source files and produce a summary
  • Spawned a verification subagent to run SymPy
  • Kept the orchestrator context clean for synthesis

It didn’t. Not because the tools don’t exist, but because Flash lacks the strategic reasoning to decompose a complex task into delegatable subtasks. That’s a legitimate model capability gap — and yes, it contradicts the I/O demo where this was presented as an automatic, magical capability.

What I’d Actually Say to the Engineers

The honest feedback isn’t “your model is …” It’s: you released an agentic platform without solving the most predictable failure mode of agentic systems — context window management under sustained complex workloads. Any task requiring synthesis across >100K tokens of source material will hit this loop with Flash. That’s not an edge case. That’s the entire class of problems the product is marketed for.

The fix is straightforward:

  1. Loop detection — if the agent calls view_file on the same path 3+ times in a conversation, flag it and force an intermediate write
  2. Compaction-aware checkpointing — before compaction, force the model to write its current analysis to disk
  3. Subagent auto-decomposition — for tasks exceeding a context threshold, automatically suggest or enforce task splitting

These are systems engineering solutions, not model improvements. Ship them.


It was marketed better than Gemini 3.1 Pro but it is surely not. It makes syntax mistakes, lies and deletes out of scope. Artificial Analysis also ranks it lower than Gemini 3.1 Pro

The level of hype with each models release does not meet the actual capabilities of said model.

This is a pattern I’ve noticed ever since the release of Gemini 3. Each model thereafter is hyped with benchmarks yet I’ve noticed very little improvement.

To be perfectly fair, I use AI for drug discovery; scientific work - like physics and complex mathematics - not just building codebases.

The issue is that, the AI’s are being optimized for the common user. Not for specialists or power users. Developers care about a specific function. The common user just wants information relevant to the inquiry and to feel like they too, can be software developers and prompt their way into a more convenient life.

The few, so very few - of us that realize the potential, only we really use AI in ways that reveal where it breaks. Being a hyperfocused Autistic helps on that as well ^_^.

I feel sad when I think of all the untapped potential, I can easily think of a way to bypass the compact issue, Opus articulated it pretty well, but if the agent is compact aware and it triggers a “write to disk” (Same way opus triggers a “scratch-pad”) and the agent is made MINDFUL of what it is about to “lose” and then it can articulate to itself the best way to KEEP what matters vs what doesn’t in order for AFTER the compact - for it to be able to continue the task to completion without getting “confused” and needing to reread and refill the context just to get its “bearings”

I can think of a dozen more improvements. But mind’s like us don’t get cultivated. Either we have no degree or we become founders and the competition (See Musk)

I just wish they would take the low hanging fruit, there are dozens of optimizations that would make using the AI feel like genuine AGI - I’ve implemented dozens of bypasses to the worst failure modes - but I’m not rich enough to make my own LLM - funny enough, the architecture described by Opus in the OP is my attempt to unify the breakthroughs of the last year in LLMs to make a model that could compete - but even then it would cost over 100k to make it. Plus I have a crappy laptop with a built-in GPU. No local AI for me.

They say you can download the IDE separately. - This is pretty much the “Agent manager” part separated from everything else. Idk, maybe its for vibe coders who don’t know about it and that this is set up as a separate download?

It’s true that my app automatically updates to version 2.0, not the IDE. It deleted my entire environment and gave me that screen. Now I have to configure everything. Thanks for this response.

Man, Gemini models are such unreliable. Qwen 3.6 performs way better than Gemini 3.5 Flash. Gemini models are a definitive lost cause, and with Gemini powering “Gemini Intelligence” in the upcoming hardware ecosystem, I’m making the decision to avoid Google products.

This happened to mine as well.