Feedback on Gemini Exp-1206: Hallucinated Language Errors in Essay Scoring

Hello,

I understand that Gemini Exp-1206 is an experimental model, and I appreciate the opportunity to provide feedback on its performance via this forum. I’m currently working on an AI in Education project exploring the use of large language models for automated essay scoring. In this application, large language models are essentially asked to scan texts for language errors and to suggest corrections for them.

While experimenting with Exp-1206, I’ve observed a recurring issue where the model hallucinates language errors. It sometimes flags grammatically and linguistically correct sentences as containing errors and then suggests the same sentence as a correction.

This behavior is problematic for essay scoring as it can lead to inaccurate assessments of student writing and hinder the model’s ability to provide constructive feedback. In my experience with other Gemini models such as 1.5-Pro or Gemini 2 Flash, I haven’t encountered this issue, suggesting it might be specific to Exp-1206.

If it is helpful for improving the model, I will try to provide some reproducible, artificial examples as illustration. However, I cannot share the original data where I encountered this problem.

Kind regards,
Rudolf Debelak
Moderated by moderator

Thank you for the feedback. Please provide us with the reproducible steps and examples, and we will pass the feedback to the product team.

Dear @camadi,

Thank you for your kind offer of giving me an opportunity to provide further examples. I had to discuss my feedback to Google with my collaborators, which is the reason for my late reply. They asked me to provide you with a made-up prompt and example text instead of using the data where we observed this problem.

I think the following prompt, which I completely wrote myself, gives a good illustration of the problem. I think it is relevant because similar prompts could be used by users in applications such as Google Docs or Gmail to check the correctness of texts. As with my examples, I used the default hyperparameters of Gemini Exp-1206, and used a German prompt.

The prompt is the following text, which is a basic request for checking the spelling and grammar and for providing corrections, followed by a brief text in which a person expresses admiration for their father, a famous elderly painter:

"Bitte beurteile folgenden Text und gib Feedback in Bezug auf Rechtschreibung und Grammatik. Falls du sprachliche Fehler findest, schlage Verbesserungen vor:

    'Mein Vater ist ein bekannter Maler, der zahlreiche Preise gewonnen hat, und mir durch seine hohe Leidenschaft stats als Vorbild diente. Obwohl er eher zjs zurückgezogen lebe, ist er Mentor zahlreicher Künstlerinnen und Künstler, was ich sehr bewundere.
Es ist nicht leicht, als Künstler ein Leben in Wohl stand zu führen, allerdings hat er durch seine Aufopferung seiner Frau, meinen Geschwistern und mir ein angenehmes Leben ermöglicht. Obwohl mir das als Kind nie bewusst war, habe ich später gemerkt, wie sehr diehse Opfer ihn gekostet haben.

Trotz der vielen Opfer und der Unsicherheit ist sein Leben erfüllt. Ich würde sogar sagen , dass sein Leben , das der Kunst und Malerei gewidmet ist , überaus angenehm für ihn ist . Kunst und Malerei sind seine Leidenschaft und das sein ganzes Leben lang.'"

Based on this prompt, Gemini Exp-1206 provided extensive feedback that contained several incorrect points, such as:

  • It wants to add another word to “zjs”. “zjs” is not a German word, and serves as placeholder for a typo. The sentence is correct without this word.
  • It wants to add a comma after “zurückgezogen lebe”, although the text already contains one.
  • It states that the comma after “Aufopferung” is unnecessary, although there is no comma at this point.
  • It makes incorrect statements regarding the placement of the commata in the last two sentences.
  • It incorrectly states that a verb is missing in the final sentence.

Of course, many of the points raised by the model were completely valid, but it seems necessary to check the model’s output in similar applications.

In other texts, which I cannot share, this model also made up spelling errors in a similar fashion. I suspect that this model might show a general weakness in checking the spelling and grammar of texts, which might be relevant for specific application cases.

In case I can further support you, for instance by sharing the Python code for generating this output or Gemini Exp-1206’s complete (German) output, please write me.