Researchers build a test that breaks current AI planning — and it's just arithmetic

What happened

Computer scientists created a planning benchmark based on the game Countdown (form a target number from inputs using basic math) that exposes a fundamental weakness in how AI systems think ahead. Unlike existing tests, this one has verifiable right answers, resists memorization, and consistently defeats current language models — meaning the gap between human and machine planning ability is wider than previous benchmarks suggested.

Why it matters

For years, AI researchers have used loose benchmarks to measure planning ability — travel itineraries, game strategies — where it's hard to tell if the AI actually planned or just pattern-matched. This benchmark is mathematically precise: you can prove whether an answer works. What matters is the result: it turns out current language models fail at something humans find trivial, which means the confidence people have in AI planning abilities is based on easier tests than they realized. The implication is uncomfortable: if AI can't reliably solve simple arithmetic planning, the claims about long-term reasoning in more complex domains need skepticism.

The signal

Watch whether labs working on AI planning start using Countdown as a standard benchmark over the next 12 months, or whether they avoid it because the results are embarrassing.