Researchers build a system that finds what breaks AI math — then generates harder problems to expose it

What happened

A new tool uses AI itself to identify the specific math concepts where large language models fail, then automatically generates harder problems that target those weaknesses. This means researchers can now create custom benchmarks that don't require manual work, adapt to new models as they improve, and test capabilities beyond math.

Why it matters

Until now, testing whether an AI can do math required humans to manually write problems and assemble benchmarks — a slow process that can't keep pace with new models or prevent the AI from just memorizing answers from the training data. This pipeline automates the discovery of specific gaps (certain types of geometry, specific calculation patterns) and generates fresh problems to exploit them, which is faster and scales across any domain. The practical effect is that benchmarks stop being artifacts researchers assemble and start being tools that automatically chase the target.

The signal

Watch whether this tool actually gets used in the next round of model evaluations, or whether researchers stick with manual benchmarking because the AI-generated problems don't capture what they actually care about testing.