What happened
Researchers built a system that lets AI agents run many experiments in parallel instead of one at a time, and added better ways to tell if an experiment actually worked — letting the AI keep improving even over long search periods. This means AI can now solve harder research problems on its own without getting stuck or fooled by bad measurement signals.
Why it matters
This is a demonstration that AI research agents can cross a real capability threshold — moving from toy benchmarks to sustained performance gains — by fixing structural constraints rather than just scaling up. The ablation work showing that prior 'overfitting' was evaluation noise, not memorization, is honest failure documentation that matters outside the lab.