GPQA-Diamond Benchmark Results

bkj · August 22, 2025, 7:12pm

On the Gemini release pages:

GPQA-Diamond performance is reported as 82.8% and 86.4%, respectively.

However, on both a) ArtificialAnalysis and b) my own benchmarking using the OpenAI’s simple-evals harness, performance is 3-4 points lower.

Are the details around how the “official” Google benchmarks were run? Presumably they use a different prompt, access to tools, best-of-n, … Would love to be able to (approximately) reproduce the “official” numbers!

Thanks!

Topic		Replies	Views
Significant Difference in Response Quality between Google AI Studio and Gemini 2.5 Pro API (gemini-2.5-pro-03-25) Gemini API feedback , api , gemini-25 , gemini-2-5	7	613	June 4, 2025
Gemini-1.5-pro-latest performs WORSE since yesterday. How to use its previous version? Gemini API	35	899	September 2, 2024
Gemini 2.5 Pro has extremely poor performance Gemini API models , gemini	1	194	July 2, 2025
Gemini 2.5 AI never been so bad Google AI Studio gemini , gemini-cli	1	184	July 7, 2025
The performance of Gemini 2.5 Pro has significantly decreased! Google AI Studio models , gemini-2-5	1	37	August 22, 2025

GPQA-Diamond Benchmark Results

Related topics