What happened
Foundation models like GPT-5 and Gemini achieve high overall accuracy on navigation tasks, but still make dangerous or invalid decisions in the remaining cases. This matters because if an autonomous system is trusted to navigate safely 93% of the time, the 7% of failures might involve crash paths or ignored emergency protocols — and users won't know which decisions to distrust.
Why it matters
Current AI models hide critical failure modes behind aggregate performance scores: a 93% success rate obscures systematic safety violations that would be unacceptable in any real deployment. Before any autonomous navigation system (delivery robots, warehouse logistics, emergency response) goes into production, you need to know not just how often it works, but what specific ways it breaks.