Report hallucinations in 2.5 model series?

japi · May 20, 2025, 1:40pm

Hi,

we are generally happy with Gemini-2.5-Flash. However, there are sometimes hallucinations in the generated content. As the generation is not deterministic, there is no obvious way to reproduce.

Is there interest in having the hallucinations and false positives reported? If so, what is the preferred way of reporting? The text input may sometimes be large, but we are happy to share them publicly.

Thanks

Jay · May 20, 2025, 11:00pm

Google offers the top-p parameter, Nucleus Sampling, that can make generations more truthy and at the lowest settings, approach determinism.

Background: Every token generated by a language model is a sampling of certainties of tokens, calculated against all possible tokens, not a singular choice the AI consciously made.

So if I ask “True or False: alligators can sneeze.”

We can see in this image of logit probabilities (logprobs) translated to probability space in the tooltip, that default open sampling might produce “True” as the first token 5.3 percent of the time. Then what continues after would be supporting language of the particular branch. You might call that a hallucination - or just wrong. Or even 95% wrong, since I don’t know alligator biology.

top-p can reduce such occurrences. The top-p: 0.95 (95% probability) that you are given as a default in AI Studio for models wouldn’t reduce the appearance of True; it is still included in the set. However, top-p 0.50 would deliver only token numbers that appear in the top 50% of the distribution - in this case, there would be only one choice.

True hallucinations, the AI going into areas it cannot possibly answer, are usually countered by denial post-training. Where the AI might produce for you, “I’m sorry, but as an AI language model, I cannot answer tomorrow’s winning lottery numbers” instead of going along with your request.

Reporting: I don’t know any avenue to report, or the use of reporting “AI produced this mistruth”, because there’s a billion questions never asked before that could be considered such stimulus, and an AI that answers better in general (for the budget) is everyone’s goal.

Adjacent entertainment: new awkward thinking summary - and Gemini (compared to the other provider’s AI) answered True.

OrangiaNebula · May 21, 2025, 12:10am

Welcome to the forum.

The answer above by Jay is excellent. Special thanks for the humorous addendum (“exploring nasal anatomy” is top notch hilarious). I was only going to add, when using AI Studio, you still can use the “thumbs up” and “thumbs down” buttons, at least I do when the model is egregiously hallucinating (and then check the factually incorrect checkbox in the popup). I think that’s useful, not for the current version of the model, obviously, but for the next one.

Hope that helps.

japi · May 21, 2025, 6:57am

Thank you for both answers so far!

I am not using AI Studio, so I can’t use the thumbs feature unfortunately.

I guess I’ll try to reduce topP (and possibly temperature) to see if it improves things. The reason I haven’t tried this before is because I generally want to use the default values and I was worried it will reduce its creativity and be less useful.

To give some more background on my task: The LLM is provided with the prompt (“find typos”) and the full text (a git diff) to evaluate and is supposed to only reply with snippets from the text. (Ref: prompt)

Normally this works, but in some cases it comes up with stuff like:

- uint66_t -> uint64_t [invalid type name]
- limitdescendant‌size -> limitdescendantsize [contains zero-width space]

Where the provided text does not contain uint66_t, or a zero-width space. (My bet would be that during the thinking process those are made up, but I haven’t confirmed this)

Thinking about an alternative solution myself, I could probably use the structured response format in JSON, and then double-check the response to filter any suggestions that do not relate to the provided text. However, I’d be more excited if the LLM could handle this itself (but maybe I am dreaming too much).

Topic		Replies	Views
2.5 pro just started hallucinating Gemini API models	10	610	May 20, 2025
All Gemini goes wrong Google AI Studio feedback , bug , models	17	1793	January 30, 2025
Hallucination of Gemini model 1.5 flash Gemini API models	4	263	October 16, 2024
How better is the new model (gemini 1.5 pro experimental 0827) Google AI Studio models	18	1899	September 24, 2024
Gemini 2.5 Pro Preview is very bad! Google AI Studio api , models	22	2437	May 20, 2025

Report hallucinations in 2.5 model series?

Related topics