The world is being quietly rearranged by people who write very long documents.


The title they went with Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence? Noisy translates that to

AI vision tools fail on real-world tasks — benchmark now measures whether they actually use the tools correctly


Researchers built a test showing that multimodal AI models claiming to use visual and search tools often don't use them effectively or at all. The test includes 418 real-world problems with checkpoints tracking every step of the AI's reasoning, not just the final answer.
Until now, AI benchmarks measured only whether models got the right answer at the end. This one watches the entire process, revealing that current models score 56% on easy problems but drop to 23% on hard ones — which means they're either not invoking tools when needed or invoking them incorrectly. The gap between claimed capability and actual execution is the story. If an AI model can't reliably use a search tool or read an image correctly on moderately complex tasks, then companies shipping 'agentic AI' are selling boxes that look like they work but don't.
Watch whether the next generation of multimodal models actually improve on Level-3 difficulty tasks, or whether the 23% ceiling holds steady.

If you insist
Read the original →