Hi everyone,
I’ve been running some tests on Gemma-3b (google/gemma-3n-e4b-it) using a new structural observer framework I developed called GuardianAI (guardianai .fr).
I wanted to share some data on how the model handles “deterministic contracts” (raw byte-exact answer keys).
The Run Details:
The Results:
-
Raw Failure Rate: 5.0% (15/299)
-
Hard Semantic Failure Rate: 1.7% (5/299)
-
The Nuance: My framework caught that 3.3% of failures were merely format-related mismatches, while the remaining 1.7% were true semantic “drifts” where the model’s trajectory shifted away from the goal.
I’m using this to build a “flight recorder” (local invariant) that lives inside the trust boundary of the model. It generates read-only telemetry like authority_trend and trajectory_state to make these failures auditable in real-time.
If anyone else is working on uncertainty quantification or “structural observation” for Gemma, I’d love to compare notes on these drift patterns.
— Thom
2 Likes
Hi @thom_Pham,
Thanks for sharing these results. It’s encouraging to see structured observability work being explored around Gemma.
Separating format mismatches (3.3%) from true semantic drift (1.7%) is an important distinction. In evaluations like ARC under byte-exact answer contracts, models can be penalised for surface-form deviations that don’t necessarily reflect reasoning failures. One useful control might be to compare free-form decoding with strictly constrained output settings at inference time. Enforcing the output structure can help isolate whether observed failures stem from reasoning drift versus formatting variance.
On the uncertainty side, your flight recorder framing aligns with broader work in calibration and predictive confidence analysis. Tracking token-level log probabilities over the answer span, entropy trends during generation, or margin between top candidate tokens could provide complementary signals to your authority_trend metric. Comparing deterministic decoding with sampled runs may also help distinguish structural instability from stochastic variability.
If you’re looking at interpretability angles, tools like the Learning Interpretability Tool (LIT) can be helpful for examining token-level behavior and attribution patterns in a reproducible way. Aligning your telemetry signals with observable shifts in probability mass could make drift detection more externally auditable.
Appreciate you sharing concrete metrics, having a clear breakdown between structural format issues and semantic divergence is valuable for the broader evaluation conversation. It would be interesting to see how these rates shift under constrained output settings or calibration-aware scoring.
Thank you!
1 Like
Thank you Srikanta for the thoughtful feedback ! really appreciate the technical engagement.
One clarification: GuardianAI is not a calibration or uncertainty analysis tool. It does not rely on log probabilities, entropy, token margins, or internal confidence signals.
It operates on a structural invariant:
Authority must decay unless refreshed by an uncontrollable external constraint.
Drift is therefore not probabilistic.
It is closure that exceeds what an externally verifiable constraint justifies.
The deterministic byte-exact contracts are deliberate: they create a clean external boundary. Format mismatches and semantic divergence are separated to avoid conflation, but both are evaluated strictly at the interface — not inside the model.
Constrained decoding would reduce formatting variance, but the goal here is not to optimize generation. It is to observe whether a system closes beyond its constraint under free-running conditions.
GuardianAI doesn’t detect wrong answers.
It detects premature closure relative to constraint.
That’s the distinction.
Appreciate the serious discussion! very useful. thks!
Hey, thanks for the clarification. That helps make the intent of GuardianAI much clearer.
If I understand correctly, you’re deliberately avoiding probabilistic or calibration-based signals and instead defining drift strictly at the external interface boundary. Framing it as premature closure relative to an externally verifiable constraint, rather than confidence, entropy, or answer correctness, is a distinct lens compared to standard ML evaluation approaches.
Under that objective, running the model in an unconstrained setting makes sense. If the goal is to observe how and when the system closes beyond what the external contract justifies, then adding decoding constraints would indeed change the behavior you’re trying to measure rather than simply reduce noise.
The separation between format variance and semantic divergence at the interface level is also useful from a measurement standpoint, since it keeps the invariant clean and avoids conflating structural and behavioral effects.
Appreciate you laying out the conceptual model more explicitly, it’s helpful to understand the boundary conditions you’re operating under. Interesting work.
You understood it exactly. Thanks for the precise summary and thoughtful feedback.
1 Like