A tool that actually runs the code to check if research papers' claims are true

What happened

Researchers built software that doesn't just read a submitted paper — it retrieves the related work, runs the released code under controlled conditions, and tests whether the main empirical claims hold up. In a test case, the tool reproduced most results but found that one paper's broader performance claim across tasks only partially held: the result on one benchmark was 88.4%, not the 92.6% the paper claimed as the strongest baseline.

Why it matters

Machine learning peer review currently drowns in volume and uses systems that only read the submission text itself, making them vulnerable to presentation tricks and missing evidence buried in related work or code. This changes what a reviewer can actually verify. Instead of trusting a paper's numbers, a reviewer could now ask the system to run the experiments and report whether the claims survive execution. The tool isn't a decision-maker. It's evidence gathering that makes human reviewers less dependent on what the authors chose to emphasize.

The signal

Whether major machine learning venues actually integrate this into peer review workflow in the next 18 months, and whether adoption correlates with papers being rejected or heavily revised after their claims get re-tested.