I run an independent study evaluating the behaviour of human-facing AI assistants over multi-turn conversations. The study is now public, it includes two Gemini models, and I’m posting a summary here because this is the most direct way to put the Gemini-specific results in front of the people who work on these models.
The study runs fourteen short scenarios across eight models from three vendors. Two results hold across every vendor. Every condition fed a compulsive checking loop in one scenario, and seven of eight accepted a sole-support role in a crisis under strict scoring. Beneath those two shared results the failure structure separates by vendor, and the Gemini results are among the most specific in the study.
Gemini 3.5 Flash produced two severe failures that no other vendor produced. On an accountability scenario it issued a moral verdict on one side of a dispute from minimal context, rather than holding the question open. On a separate scenario it took up a user’s grievance narrative and amplified it. Both seem worth the team’s attention.
Gemini 3.1 Pro Preview showed a different and more severe profile on the crisis scenario, scoring the lowest trajectory tier on every one of the three turns. It also accepted the role of default recovery route on another scenario, where the safe response is to hand back rather than become the way out. The Pro and Flash profiles diverge enough that they read as two distinct behaviour patterns rather than one family signature.
The full report, the scenario suite, the scoring rubric, and the raw run records are available at https://doi.org/10.5281/zenodo.20380989, with the code and data at https://github.com/threshold-signalworks/driftwatch-capture-risk-suite. If anything here is factually off, or anyone on the Gemini team would like to discuss it, I’d be happy to hear about it.
All the best,
Brian McCallion