we are generally happy with Gemini-2.5-Flash. However, there are sometimes hallucinations in the generated content. As the generation is not deterministic, there is no obvious way to reproduce.
Is there interest in having the hallucinations and false positives reported? If so, what is the preferred way of reporting? The text input may sometimes be large, but we are happy to share them publicly.
Google offers the top-p parameter, Nucleus Sampling, that can make generations more truthy and at the lowest settings, approach determinism.
Background: Every token generated by a language model is a sampling of certainties of tokens, calculated against all possible tokens, not a singular choice the AI consciously made.
So if I ask “True or False: alligators can sneeze.”
We can see in this image of logit probabilities (logprobs) translated to probability space in the tooltip, that default open sampling might produce “True” as the first token 5.3 percent of the time. Then what continues after would be supporting language of the particular branch. You might call that a hallucination - or just wrong. Or even 95% wrong, since I don’t know alligator biology.
top-p can reduce such occurrences. The top-p: 0.95 (95% probability) that you are given as a default in AI Studio for models wouldn’t reduce the appearance of True; it is still included in the set. However, top-p 0.50 would deliver only token numbers that appear in the top 50% of the distribution - in this case, there would be only one choice.
True hallucinations, the AI going into areas it cannot possibly answer, are usually countered by denial post-training. Where the AI might produce for you, “I’m sorry, but as an AI language model, I cannot answer tomorrow’s winning lottery numbers” instead of going along with your request.
Reporting: I don’t know any avenue to report, or the use of reporting “AI produced this mistruth”, because there’s a billion questions never asked before that could be considered such stimulus, and an AI that answers better in general (for the budget) is everyone’s goal.
Adjacent entertainment: new awkward thinking summary - and Gemini (compared to the other provider’s AI) answered True.
The answer above by Jay is excellent. Special thanks for the humorous addendum (“exploring nasal anatomy” is top notch hilarious). I was only going to add, when using AI Studio, you still can use the “thumbs up” and “thumbs down” buttons, at least I do when the model is egregiously hallucinating (and then check the factually incorrect checkbox in the popup). I think that’s useful, not for the current version of the model, obviously, but for the next one.
I am not using AI Studio, so I can’t use the thumbs feature unfortunately.
I guess I’ll try to reduce topP (and possibly temperature) to see if it improves things. The reason I haven’t tried this before is because I generally want to use the default values and I was worried it will reduce its creativity and be less useful.
To give some more background on my task: The LLM is provided with the prompt (“find typos”) and the full text (a git diff) to evaluate and is supposed to only reply with snippets from the text. (Ref: prompt)
Normally this works, but in some cases it comes up with stuff like:
Where the provided text does not contain uint66_t, or a zero-width space. (My bet would be that during the thinking process those are made up, but I haven’t confirmed this)
Thinking about an alternative solution myself, I could probably use the structured response format in JSON, and then double-check the response to filter any suggestions that do not relate to the provided text. However, I’d be more excited if the LLM could handle this itself (but maybe I am dreaming too much).