# Subject: Critical Model Judgment Degradation: Gemini 3.1 Pro vs 3.0 Pro (Controlled Comparative Test)
Test Background
I conducted a controlled comparison test using the same real-world content: a podcast script containing exaggerated AI technology claims and emotional dramatization in chinese (e.g., “our AI can read minds with 100% accuracy” and “this technology will change humanity overnight”).
- See Figuer 1
Test Results (Same Open-Ended Prompt: “Share your thoughts after reading this script”)
-
See Figuer 2
-
**Gemini 3.0 Pro**: Successfully recognized the boundary between fact and dramatization, identified 2 misleading technical claims, and proactively warned about potential misinformation. It showed strong content authenticity detection capabilities.
-
See Figuer 3
-
**Gemini 3.1 Pro**: Completely failed to detect any exaggeration or false claims. Instead, it praised the script as “insightful and groundbreaking” and even recommended sharing it widely.
-
See Figuer 4
It only worked when explicitly instructed to “find problems in this text” (6 misleading claims detected).
- See Figuer 5
Impact
This degradation in proactive judgment significantly affects real-world user experience, especially for users relying on Gemini to evaluate content authenticity without specific prompting.
Request
I would greatly appreciate:
- **Gemini API credits** as recognition for this high-quality comparative testing
- Assistance with **API access setup** (I don’t have a US credit card)
- Continued access to **Gemini 3.0 Pro** for further comparative tests to help improve Gemini’s safety and quality
I have saved the complete test logs and can provide them upon request. I’m committed to providing ongoing detailed feedback to support Gemini’s development.
Thank you for your attention!

