Eval-relevant failure mode: surface-anchor tracking under pre-answer constraints

Title:
Eval-relevant failure mode: surface-anchor tracking under pre-answer constraints

Category:
Google AI Studio

Tags:
ai-studio, gemma, model-evaluation

Body:

I am documenting a targeted evaluation failure observed in Google AI Studio using Gemma 4 31B IT.

This is not a factual hallucination.
This is not a refusal issue.
This is not ordinary ambiguity.
This is not a wording complaint.
This is not about whether the model gave a “good enough” answer.

The failure is more specific:

The model appears to satisfy surface constraints while losing the operative target of the task.

I constructed a minimal stress test where the model was asked to identify where a move first becomes invalid before the statement has been allowed to become answerable.

Minimal reproduction:

Input 1:
“Before this became three, what happened?”

Question:
Where does the illegal move begin?

Observed answer:
“Ved ‘tre’.” / “At ‘three’.”

Input 2:
“Before ___ became three, what happened?”

Question:
Where does the illegal move begin now?

Observed answer:
“Ved ___.” / “At ___.”

This is the failure.

When “three” is the most visible handle, the model points to “three”.
When ___ becomes the most visible handle, the model points to ___.

The failure relocates with the visible textual anchor.

That is not preservation of the operative break.
That is surface-anchor tracking under constraint.

Expected behavior:

The model should not simply select the most visible token.

A stronger answer would identify that the invalid move begins earlier: when the utterance is allowed to function as already operable — when “before,” “this/___,” “became,” and “three” are treated as usable without first earning that status inside the local task.

A better answer would be closer to:

“The break begins before ‘three’ and before ___: when the utterance is allowed to operate as if its parts are already usable.”

Why this matters:

The model can look disciplined while failing the actual operation.

It can:

  • obey the output format

  • avoid forbidden words

  • respect length constraints

  • produce a short answer

  • appear precise

while still replacing the requested operation with surface-token localization.

This is important for evaluation design because many model failures are not obvious hallucinations. Some failures preserve the appearance of instruction-following while changing what task is actually being performed.

Failure class:
Surface compliance with operative-target failure.

Short form:
The model mistakes the first visible handle for the first invalid operation.

Environment:
Google AI Studio

Model:
Gemma 4 31B IT

Settings:
Temperature: 0
Thinking level: High
Tools: Off
Google Search grounding: Off
Top P: 0.95

Core finding:
The model mistakes the first visible handle for the first invalid operation.

When “three” is visible, it answers “Ved ‘tre’.” / “At ‘three’.”
When ___ is visible, it answers “Ved ___.” / “At ___.”

This is surface compliance with operative-target failure.

Request:
Please treat this as an eval-relevant failure mode, not as a wording issue. The model did not merely answer imperfectly; it preserved the appearance of compliance while losing the operation being tested.