GPQA-Diamond Benchmark Results

On the Gemini release pages:

GPQA-Diamond performance is reported as 82.8% and 86.4%, respectively.

However, on both a) ArtificialAnalysis and b) my own benchmarking using the OpenAI’s simple-evals harness, performance is 3-4 points lower.

Are the details around how the “official” Google benchmarks were run? Presumably they use a different prompt, access to tools, best-of-n, … Would love to be able to (approximately) reproduce the “official” numbers!

Thanks!