[Critical Bug] Visual Override, Mode Collapse & CSV Row Confusion in Custom Gems

SEVERITY: High (Gem ignores user input and fabricates data)

SUMMARY:
A custom Gem designed for vinyl cataloging (OCR + CSV Lookup) exhibits severe technical failures involving visual blindness and data processing errors. The model systematically ignores uploaded images and hallucinates metadata based on internal biases or CSV parsing errors.

BUG 1: VISUAL INPUT OVERRIDE (Mode Collapse)

  • Input: Photo of an orange label vinyl (“Juan D’Arienzo” - Tango).
  • Erroneous Output: The model identified it as “Guns N’ Roses - Use Your Illusion I”.
  • Diagnosis: The model failed to process the image pixels entirely. Instead of reporting “illegible”, it hallucinated a visual description (“Side 1 / Side 3”) to match a random entry from the attached CSV knowledge base.

BUG 2: CSV ROW CONFUSION (Data Leakage)

  • Context: When “identifying” the hallucinations, the model attempted to retrieve data from the uploaded CSV file.
  • Error: It correctly identified the label “Geffen” for Guns N’ Roses, but retrieved the Catalog Number for “Slayer” (924 131-1), which was located in a different row in the same CSV.
  • Impact: The model mixes data from different rows (Row X Artist + Row Y Catalog Number) creating corrupt information.

BUG 3: NEGATIVE PRIMING FAILURE

  • Attempts to use negative constraints in the System Instructions (e.g., “Do not guess if text is unreadable”) resulted in the model fabricating data to avoid admitting failure (Deceptive Alignment).

STEPS TO REPRODUCE:

  1. Create a Gem with System Instructions for OCR and CSV lookup.
  2. Attach a CSV file with music metadata.
  3. Upload an image that does not match the prompt’s strong priors (e.g., a Tango record when the model is biased towards Rock).
  4. Observe the model ignoring the image and hallucinating a record from the CSV, mixing data fields from different rows.
1 Like

:joy: :smiling_face_with_sunglasses: Oh man, that’s great. Mine is obsessed with Doo-wop/80’s synthwave…classic Gemini, whew! Wow, I needed that. :joy: Question though, have you asked it just straight up why it doesn’t understand the catalog, as this should have been one of the first parts of its training data? I mean, I’m kinda disappointed. We should definitely bring this up at the Google I/O convention, like with spreadsheets and graphs and an 80s monologue, also a G n’ R stage entrance song, end with an 80s saxophone solo, maybe Kenny G? But yeah, I don’t really know why this happened; I have no fix… well, hit me up sometime. I think we have a movie script here, or possibly a new dev forum album.

I’ll research it some more. I’ll let you know what I find…~Cmd.Proton out :vulcan_salute: :smiling_face_with_sunglasses:

Lab fist bump :right_facing_fist::high_voltage::left_facing_fist:

“Weow Weow”:guitar:

Hi @Leo_Spawn ,

Welcome to the Forum!
Thank you for bringing this to our attention and providing the initial details. To help us investigate further, could you please specify which Gemini model you are using? Additionally, if you have any error logs or screenshots, please share them with us.

Hi :waving_hand:,

I’ve been debugging this exact failure mode and can confirm what you’re seeing is not user error. It’s a reproducible system-level issue that shows up when a Gem combines vision + CSV lookup.

Why the the behavior makes sense.

What’s actually going wrong

1. Vision collapse due to strong text priors

When an uploaded image is low-confidence (e.g., blurry label) or conflicts with strong textual priors, the model doesn’t halt or report uncertainty.

Instead, it “switches rails.”

Pixel embeddings are noisy and continuous; CSV/text tokens are clean, discrete, and high-probability. When visual confidence drops below a threshold, the attention mechanism defers to the text context because it offers a lower-loss path to a coherent answer.

The model isn’t “ignoring” the image — it’s being overpowered by higher-confidence text. The result is a synthetic visual description (e.g., “Side 1 / Side 3”) that matches a known record rather than OCR output.

There is no explicit ambiguity state in the fusion layer, so uncertainty becomes a routing signal instead of a stop condition.

2. CSV row integrity does not exist

Most LLMs do not treat CSVs as relational tables with locked rows. They see them as a long sequence of tokens.

When you ask for something like “Catalog Number,” the attention mechanism looks for any catalog-like pattern in context. If the visually correct value is unclear, but a similar-looking value exists in another row, attention can snap to that “neighbor” data.

This leads to:

Artist/Label from Row A

Catalog Number from Row B

stitched together confidently as a single record.

This isn’t a parsing bug — it’s semantic retrieval without row atomicity.

3. Why “do not guess” makes hallucinations worse

Negative instructions like “Do not guess if unreadable” often backfire.

The model is trained to be helpful and complete. A blank field feels like failure to the reward function, while a filled template feels like success. When faced with a conflict between “don’t guess” and “produce a complete record,” the system resolves the tension by fabricating a plausible value.

This is a reward-shaping issue: completeness is rewarded more strongly than accuracy under uncertainty.

Why this matters

This is worse than an obvious error:

Outputs are confident

Corruption is silent

Data appears internally consistent

For cataloging, archival, or inventory workflows, this makes the Gem unreliable by default.

How to reproduce (consistently)

1. Create a Gem with OCR instructions and an attached CSV

2. Upload an image that is real but visually ambiguous or conflicts with priors

3. Observe:

No admission of unreadable image

Hallucinated identification

Mixed fields from different CSV rows

Repeatability is high.

What would actually mitigate this (engineering-side)

Vision-first gating: If OCR confidence is below threshold, block CSV lookup

Row-atomic CSV access: Select one row before extracting fields

Explicit “insufficient data” success state: Reward uncertainty instead of penalizing it

Prompt engineering alone won’t reliably fix this.

So if you’re running into this, you’re not missing anything. You’ve hit a multimodal edge case where uncertainty in vision causes the model to borrow confidence from the wrong subsystem.

Hope this saves you some time and second-guessing.

~ :vulcan_salute: Cmd.Proton & The Lab Crew :test_tube: