AI agents that check their own work before acting — now trained on 400K real computer tasks

What happened

Researchers built a reward model trained on nearly 400,000 real desktop interactions that can score whether an AI agent's next action is actually correct before it executes. When deployed to agents tackling new tasks they never trained on, this scoring system improved task success rates by nearly 7 percentage points — meaning the agent catches and avoids more of its own mistakes before they cascade.

Why it matters

Most computer-control AI agents today fire off GUI clicks without checking if those clicks make sense. Errors compound — one bad click breaks the next step, which breaks the next. This is a measurement problem: you can't fix what you can't score. This paper shows you can build that scoring model from heterogeneous real-world trajectories and it actually transfers to new agents and new tasks. That's the necessary infrastructure for making agentic systems reliable enough to run unattended. The question now is whether this generalizes beyond desktop tasks to the messier domains where agents actually need to operate.

The signal

Watch whether production AI agents shipping in the next 12 months incorporate action-scoring models, and whether their measured error rates drop proportionally or plateau at new failure modes the reward model doesn't catch.