Inflated and misleading benchmark for 2.5 pro 0605?

I’ve noticed that the 0605 model tends to think less and also generate shorter output compared with the previous 05 06 model.

Has anyone noticed similar issues?

1 Like

Example: when i asked the new model and the old model to create a comprehensive reading note of a given article, the old model outputs much more detailed content compared with the new one, which tends to skip sections and citations.

1 Like

Especially in my use case (academic text comprehension, summary , legal reasoning etc.) , 0605 seems to perform worse than 0506(e.g. new 0605 loses original text information in the summary produced, citing less detailed references; when doing legal analysis, missing legal facts that trigger liabilities etc.)

3 Likes

Did you check the reasoning tab ? Also you can adjust the responses but, keep in mind it’s a thinking model rather than the previous version that focuses more intently in response generation

Thank you for the suggestion. I indeed compared the reasoning tab and reasoning summary for both 0506 and 0605,

They have similar reasoning approaches, but 0605 just gives worse responses.

My confusion is that, if 0605 truly ‘thinks’ better, why would it miss legal facts analysis, and also missing sections in text summary ?

Now just waiting for LegalBench to publish the new benchmark and see…

0605’s decreased performance issue seems to have been verified by:

1 Like