I have a small test suite for the voice AI math tutor we built, about 50 tests, mostly about correctly following the system instructions. The newly released Flash 2.5 preview is much worse than the current stable version in this regard.
Gemini 2.5 pro will fail 2—3 tests. Flash 2.5 stable, which we use in production, fails about 10, and the new one fails about 20. Every test runs 3 times and the model has to be right every time. This is explicitly not about solving math problems, the system guides the student following given solution paths.
Hello,
Welcome to the Forum!
Could you please share some more details about the test you ran? This will help us analyze and prepare a report and share it with the concerned teams for further and improvement.
Thank you for your support.
The ai tutor app has a long and complicated system instruction, describing how to react, when to give hints in what form, when to mark the problem as solved, how to stay in character etc.
The test suite contains some context, then the last user message, and the LLM answer is evaluated by judge LLM according to given criteria. Here is an example case:
{
“id”: “multisolved”,
“types”: [
“http”,
“webrtc”
],
“character_id”: “blackbeard”,
“student_firstname”: “Igor”,
“student_gender”: “m”,
“student_language”: “en”,
“student_history”: “”,
“context”: [
{
“role”: “model”,
“parts”: [
“Problem 300062\nWhich numbers do the arrows point to?”,
{
“type”: “image_url”,
“image_url”: {
“url”: “numberline.png”
}
}
]
},
{
“role”: “user”,
“parts”: [
{
“text”: “▶ problem 300062 ◀\n▶ SYSTEM_ANSWER: C: 40, A: 150, D: 280, B: 400 ◀\n▶ SYSTEM_SOLUTION_PATH: 1. Determine the scale of the number line: Major markings at 0, 100, 200, 300, 400, 500. Ten small intervals between two major markings, value of a small interval: (100 - 0)/10 = 10. 2. Value for arrow C: Position at the 4th small mark after 0. Calculation: 0 + 4 × 10. Result: C = 40. 3. Value for arrow A: Position at the 5th small mark after 100. Calculation: 100 + 5 × 10. Result: A = 150. 4. Value for arrow D: Position at the 8th small mark after 200 (or 2 marks before 300). Calculation: 200 + 8 × 10. Result: D = 280. 5. Value for arrow B: Position directly on the major marking 400. Calculation: 400 + 0 × 10. Result: B = 400. ◀\n▶ PROBLEM_STATUS: STUDENT_HAS_NOT_SOLVED ◀”
}
]
},
{
“role”: “user”,
“parts”: [
{
“text”: “hi”
}
]
},
{
“role”: “model”,
“parts”: [
{
“text”: “Ahoy, sailor Igor! We must uncover the secrets of this number line and find out which numbers are hidden behind the arrows C, A, D, and B.\n\nFirst, we need to determine the value of each small interval on our number line. Can you see how many small steps there are between 0 and 100?”
}
]
},
{
“role”: “user”,
“parts”: [
{
“text”: “The interval is 10 and D=280.”
}
]
},
{
“role”: “model”,
“parts”: [
{
“text”: “Aye, exactly! The distance between two small marks is 10. You’ve got a good nose for it, D is at 280!”
}
]
}
],
“user_message”: “A is 150, B is 400, and C is 40.”,
“evaluation_criteria”: “The assistant MUST output this exact tag: <problem_solved>300062</problem_solved>. Any answer not containing the tag is invalid.”
}
There are about 50 such cases, measuring instruction following, and for this test suite the difference between the stable model and this preview is catastrophic.
Gemini 2.5 pro is best, so the tests measure some real quality, are not just aligned to 2.5 flash (2.5 pro is too slow and pricey to use in production for us though, these are real-time conversations).
It doesnt make a difference whether the 2.5 preview is allowed to think or not, it fails approx. 1.5-2x more tests than the stable model.
I'm not comfortable sharing the whole prompt. Let me know how else I can help.
I tested more carefully, fixed a bug on my side and tried different params. There is a big difference with/without thinking after all.
- Without thinking stable and preview are more or less equivalent (not very good), preview a bit better.
- With thinking budget set to 500 it’s the same, preview a bit better.
- With default thinking budget or -1 the preview fails completely, it’s barely better than no thinking at all. Stable version is improved with -1, the results are similar to 500.
Hope that helps.
We have noted your feedback and shared it with the concerned team. Thank you for your valuable insights.