AI evaluators that actually say why they failed — not just pass or fail

What happened

A new evaluation method breaks down GUI agent tasks into smaller, testable steps instead of judging entire action sequences at once. This means developers can see exactly where an AI agent goes wrong and why, instead of getting a binary pass-fail verdict on a 50-step task they can't debug.

Why it matters

Current AI evaluators treat a long task as one black box — the agent either completes it or doesn't, with no visibility into failure. The opacity makes it nearly impossible to improve the agent, because you can't see which step broke or why. This method segments tasks into logical subtasks and diagnoses each one separately, which means developers get actionable diagnostic reports instead of a thumbs-up or thumbs-down. It also scales better as tasks get longer, because evaluators no longer choke on context overload.

The signal

Watch whether teams building GUI agents actually adopt this evaluation method, and whether agents trained with this diagnostic feedback improve faster than agents trained with existing single-verdict evaluators.