On the Gemini release pages:
GPQA-Diamond performance is reported as 82.8% and 86.4%, respectively.
However, on both a) ArtificialAnalysis and b) my own benchmarking using the OpenAI’s simple-evals
harness, performance is 3-4 points lower.
Are the details around how the “official” Google benchmarks were run? Presumably they use a different prompt, access to tools, best-of-n, … Would love to be able to (approximately) reproduce the “official” numbers!
Thanks!