AI vision tools fail on real-world tasks — benchmark now measures whether they actually use the tools correctly

What happened

Researchers built a test showing that multimodal AI models claiming to use visual and search tools often don't use them effectively or at all. The test includes 418 real-world problems with checkpoints tracking every step of the AI's reasoning, not just the final answer.

Why it matters

Until now, AI benchmarks measured only whether models got the right answer at the end. This one watches the entire process, revealing that current models score 56% on easy problems but drop to 23% on hard ones — which means they're either not invoking tools when needed or invoking them incorrectly. The gap between claimed capability and actual execution is the story. If an AI model can't reliably use a search tool or read an image correctly on moderately complex tasks, then companies shipping 'agentic AI' are selling boxes that look like they work but don't.

The signal

Watch whether the next generation of multimodal models actually improve on Level-3 difficulty tasks, or whether the 23% ceiling holds steady.