The world is being quietly rearranged by people who write very long documents.


The title they went with Don't Blink: Evidence Collapse during Multimodal Reasoning Noisy translates that to

AI vision systems get smarter while looking less — and text-only safety checks can't catch it


Reasoning AI systems that process images can produce confident answers while gradually ignoring the visual information they're supposed to be grounded in — a failure mode that text-based safety monitoring cannot detect. This means an AI can seem to reason correctly while actually making decisions untethered from the image it's analyzing, especially on tasks that require sustained visual reference.
Deploying multimodal AI systems currently relies on monitoring text outputs and entropy signals to catch failures, but those signals miss a specific failure mode: the system stops looking at the image while maintaining confidence. When you can't see what the AI stopped paying attention to, you can't know when it's operating blind. This research shows that adding visual monitoring selectively (task-aware) catches these failures where pure text monitoring fails, which matters because it means deployment safety depends on understanding which tasks are actually vulnerable to visual disengagement.
Whether deployed multimodal AI systems start incorporating vision-based uncertainty signals alongside text signals, and whether that reduces errors on real-world tasks where visual grounding actually matters.

If you insist
Read the original →