What happened
Researchers found that when large vision language models spend time reasoning through a problem, they often ignore the image and go wrong — but adding brief phrases that force the model to look back at the picture fixes this. This matters because it shows how to make AI systems that combine reasoning with visual grounding, without burning tokens on long chains that lead nowhere.
Why it matters
For the first time, this work measures what actually happens inside a vision model's reasoning process and shows that more thinking doesn't mean better thinking — the structure of that thinking (whether it references the image) determines success far more than its length.