Vision-language AI models struggle to act precisely, even with perfect scene maps

What happened

Researchers tested whether showing AI vision models both a picture and a symbolic map of a scene helps them play games better — and found it only works when the AI can accurately extract those maps itself. In practice, this reveals that AI models trained on internet images can describe what they see, but translating that into correct actions in interactive environments requires a reliability bottleneck the current models haven't solved.

Why it matters

This is honest failure documentation: it shows a real gap in deployed vision-language AI that no amount of scaling has fixed yet. The bottleneck isn't reasoning or language — it's perception reliability, which means throwing more data or parameters at the problem won't solve it. For anyone betting on AI agents that need to act precisely in the physical world, this signals the problem is harder than current approaches assume.