New video benchmark exposes limits of AI visual reasoning

What happened

Researchers created a dataset of 1,114 complex video questions that require AI systems to piece together evidence spread across time — not just analyze single moments — to answer correctly. Current best-in-class AI models score only 46%, while humans struggle too when they can't rewatch, suggesting that even our most advanced systems have fundamental gaps in understanding how things connect visually and temporally.

Why it matters

This reveals a genuine bottleneck in what AI can actually do: most video benchmarks test pattern-matching on simple tasks, but real-world video understanding requires connecting distant pieces of visual information over time, which current AI systems flatly cannot do reliably — this benchmark gives researchers a concrete map of where that gap is.