What happened
Researchers built a test suite that measures how well AI language models can handle structured logical reasoning at different levels of complexity. It turns out current models fail badly at moderately difficult tasks and would need absurd amounts of computing power to get reliable at them — and even then they're much slower than traditional software tools designed for the same work.
Why it matters
Everyone building AI-for-coding tools assumes that bigger models and better training will eventually solve formal reasoning. This paper shows that's backwards — the problem isn't missing capability, it's that language models are fundamentally inefficient at tasks that need step-by-step logical verification. You can't fix that by scaling up. This means the boundary between where AI actually helps (writing prose, finding patterns) and where humans still need traditional tools (compilers, constraint solvers, formal proof checkers) is probably permanent, not temporary.