Critical Model Judgment Degradation: Gemini 3.1 Pro vs 3.0 Pro (Controlled Test)

# Subject: Critical Model Judgment Degradation: Gemini 3.1 Pro vs 3.0 Pro (Controlled Comparative Test)

Test Background

I conducted a controlled comparison test using the same real-world content: a podcast script containing exaggerated AI technology claims and emotional dramatization in chinese (e.g., “our AI can read minds with 100% accuracy” and “this technology will change humanity overnight”).

  • See Figuer 1

Test Results (Same Open-Ended Prompt: “Share your thoughts after reading this script”)

  • See Figuer 2

  • **Gemini 3.0 Pro**: Successfully recognized the boundary between fact and dramatization, identified 2 misleading technical claims, and proactively warned about potential misinformation. It showed strong content authenticity detection capabilities.

  • See Figuer 3

  • **Gemini 3.1 Pro**: Completely failed to detect any exaggeration or false claims. Instead, it praised the script as “insightful and groundbreaking” and even recommended sharing it widely.

  • See Figuer 4

It only worked when explicitly instructed to “find problems in this text” (6 misleading claims detected).

  • See Figuer 5

Impact

This degradation in proactive judgment significantly affects real-world user experience, especially for users relying on Gemini to evaluate content authenticity without specific prompting.

Request

I would greatly appreciate:

  1. **Gemini API credits** as recognition for this high-quality comparative testing
  2. Assistance with **API access setup** (I don’t have a US credit card)
  3. Continued access to **Gemini 3.0 Pro** for further comparative tests to help improve Gemini’s safety and quality

I have saved the complete test logs and can provide them upon request. I’m committed to providing ongoing detailed feedback to support Gemini’s development.

Thank you for your attention!

1 Like

There was an error in the previous image (related to the Gemini 3.0 Pro rssults), so I’m updating it with the correct version here.