AI can solve olympiad math but fails at research math — and now we can measure the gap

What happened

Researchers created a benchmark of 25 genuinely hard math problems (the kind that took weeks for PhD mathematicians to solve) to test whether AI systems that ace math competitions can handle real research-level work. They all scored below 10%, exposing a massive gap between competition math and the kind of reasoning mathematicians actually do.

Why it matters

For three years, AI labs have claimed their systems reached gold-medal performance on the International Mathematical Olympiad, generating headlines about AI solving 'hard math.' But olympiad problems are a narrow slice — competition problems reward clever tricks over deep theory, and they come from a limited domain that AI can train on. This benchmark shows the trick: AI is pattern-matching on a constrained problem set, not reasoning at research level. The practical effect is brutal: every frontier model tested scored under 10% on problems that are harder in the way real mathematics is harder. This matters because it cuts through the hype. If you hear 'AI solved the IMO,' you now know what that actually means — and doesn't mean.

The signal

Watch whether labs start publishing attempted solutions to Riemann-Bench problems and whether those attempts reveal what kinds of mathematical reasoning the models are actually missing — or whether they simply stop claiming superiority on mathematical reasoning.