I’ve thrown roughly 4000 comparable requests at the various Gemini models and found that every Gemini model since Gemini-1.5-pro-exp-0827 has been terrible at outputting a structured output consistently, and does not following instructions well. Terrible at timestamps too.
The following could provide structured output consistently - without fail
- Gemini-1.5-pro-001
- Gemini-1.5-pro-exp-0827 (since deprecated)
Every single ones of the following models was/is unrealiable:
- Gemini-exp-1121
- Gemini-exp-1206
- Gemini-2.0-flash-exp
- Gemini 2.0-flash-exp-thinking-1219
- Gemini 2.0-flash-exp-thinking-0121
- Gemini 2.0-pro-exp-0205
The “thinking” models have improved the output, but I’m still regularly getting really poor performance.
E.g. If I ask for a really simple output with XML tags. It will output 20 records correctly, e.g.
Topic title </topictitle
00:00-01:23
And then it will randomly alter the closing XML tags on, for example, the 21st record, e.g.:
Topic title - XML closing tag misspelled
00:00-01:23 - XML closing tag mismatch
I have:
- given the model examples of what is correct
- given the model specific examples of what is not correct (including the exact mistakes it currently outputs.)
- Changed the way I’ve worded the prompt / Reiterated / Asked the model to think step by step (where it will tell me it will check for common mistakes but still outputs them)
Probably time for me to move on, I’ve given Google enough loyalty - I doubt there’s anyone reading the feedback or responding to it anyway.