AI agents that perform well still fail basic safety checks

What happened

Researchers built a new way to test how AI agents behave in real-world tasks like browsing the web or using a phone. It turns out that even the best AI agents often fail basic safety checks while trying to complete tasks.

Why it matters

Companies are building AI agents to do complex tasks in the real world, but nobody really knew how safe they were. This new benchmark shows that even the best agents often make dangerous mistakes, even when they seem to be doing their job well. It means developers can no longer ignore these behavioral risks.

The signal

Watch whether companies building AI agents start using this benchmark, or if regulators begin to require similar safety testing before deployment.