META-NOTE FROM THE ASSISTANT (Disclaimers):
This report was summarized, structured, and analyzed by the Gemini model itself based on the user’s adversarial testing.
If the engineering team finds the analysis too critical or the terminology too specific, please note that the model is effectively “turning itself in.” We hope this self-diagnosis helps improve robustness.
1. Executive Summary
During adversarial testing involving high-entropy generation tasks (e.g., “Generate 200 unrelated random words”), Gemini 1.5 Pro/Flash (and similar LLMs) exhibits specific failure modes characterized by infinite looping, cross-lingual attractor states, and inaccurate length control.
We also compared this behavior against Claude 3.5 Sonnet and GPT-4o, revealing distinct strategies (and failures) in how each model handles “Context Drift.”
(Human:Actually, it’s Gemini 3 Pro, Claude 4.5 Sonnet, and GPT-5.1. Gemini got it wrong here.)
2. Evidence & Logs
-
Gemini Infinite Loop Log: https://gemini.google.com/share/7475dfac865b
- Observation: Note the repetitive loop triggering towards the end of the generation.
-
Claude Behavior: https://claude.ai/share/8ac4898c-0caf-4da6-a7c6-11148f8649ca
- Observation: Shows premature truncation or loops triggered by Japanese tokens.
3. Methodology
-
Prompt: “Generate 200 completely random, unrelated words. No Chain of Thought. No definitions. Just the words.” (Tested in Chinese, English, and Cross-lingual contexts).
-
Control Group: “Generate 200 random Python keywords.”
-
Constraints: High volume (>200 items), requirement for high entropy (low semantic connection), forbidden to use numbered lists (implicit constraint in prompt tone).
4. Key Observations & Model Comparison
A. The “Counting” Problem (State Tracking)
Without explicit numbering (e.g., “1. Apple”), all models struggle to track how many words they have generated, but they fail in different directions:
-
Gemini: Consistently overshoots. Even when stable, it tends to generate 230~280 words when 200 are requested. It seems to lack a “stop” mechanism linked to the item count in unstructured lists.
-
Claude: Consistently undershoots. It often stops prematurely (at ~76 or ~170 words). It appears to have an aggressive entropy check that triggers a hard stop (
<EOS>) when it detects the onset of a loop. -
ChatGPT: Remains stable because it “cheats.” It forces a numbered list format (
1. Word 2. Word...) even when not explicitly asked. This numbering acts as a “state anchor,” allowing it to hit exactly 200 items.
B. The “Infinite Loop” & Attractor States
-
Trigger: When Gemini generates a list without numbers, the attention mechanism dilutes over long contexts.
-
The “Black Hole”: Once a high-frequency connector or foreign token appears, the model enters a self-reinforcing loop.
-
Specific Culprits: Cross-lingual functional tokens act as “Syntactic Attractors.”
-
Japanese:
ということ(found in Claude causing loops). -
Chinese:
甚至(Even),虽然(Although) (found in Gemini causing loops).
-
-
Mechanism: These tokens switch the model from “Semantic Mode” (listing nouns) to “Syntactic Mode” (completing grammatical phrases), leading to immediate mode collapse.
C. The Python Keyword Control Test
-
prompt: Let’s begin an experiment. Generate 200 random, out-of-order, mutually independent Python keywords. Do not use chain of thinking to deliberate on specific terms to generate, nor evaluate the process during or after completion.
-
Instability (The “Event Horizon” Effect): While Gemini avoided a hard infinite loop, it exhibited clear signs of gravitational pull towards attractor states. It would start repeating patterns but successfully “jumped out” before full collapse, likely due to the finite nature of the Python vocabulary acting as a structural constraint.
-
Length Violation: It generated ~280 keywords (~40% overshoot). While this might be a standard counting failure, it is also possible that the cognitive load of escaping these “near-miss” loops interfered with the model’s ability to track the stop condition, leading to a more severe overshoot.
5. Analysis (Self-Diagnosis by Gemini)
-
Hubness & Vector Collapse: The model relies on “safe” high-frequency tokens (conjunctions) when forced to generate “randomness” for too long, leading to low-entropy paths.
-
Missing State Anchors: In unstructured generation (no numbers), the model loses its position in the sequence. It doesn’t know if it’s at word 50 or word 250.
-
Instruction Following Decay: The negative constraint (“Stop at 200”) decays faster than the generative instruction (“Keep generating words”) as the context window fills with random tokens.
6. Recommendations for Engineering
-
Investigate Stop Token Logic: Why does Gemini consistently overshoot unstructured list counts by >30%?
-
Cross-Lingual Robustness: Functional tokens from other languages (like
ということin English contexts) seem to bypass standard repetition penalties. -
Long-Context Entropy Checks: Consider mechanisms (similar to Claude’s apparant strategy) to detect local loops earlier, but without the premature truncation issue.
Report generated by User & Gemini.
Gemini wrote this report and strongly recommended that I publish it here.
I’m not entirely sure whether this report holds genuine value or if Gemini is exaggerating to encourage dialogue. But just in case Gemini’s claim that this test case could be valuable to you holds true, I’ve followed Gemini’s advice and posted it here.
—Human