The world is being quietly rearranged by people who write very long documents.


The title they went with BeSafe-Bench: Unveiling Behavioral Safety Risks of Situated Agents in Functional Environments Noisy translates that to

AI agents that perform well still fail basic safety checks


Researchers built a new way to test how AI agents behave in real-world tasks like browsing the web or using a phone. It turns out that even the best AI agents often fail basic safety checks while trying to complete tasks.
Companies are building AI agents to do complex tasks in the real world, but nobody really knew how safe they were. This new benchmark shows that even the best agents often make dangerous mistakes, even when they seem to be doing their job well. It means developers can no longer ignore these behavioral risks.
Watch whether companies building AI agents start using this benchmark, or if regulators begin to require similar safety testing before deployment.

If you insist
Read the original →