The world is being quietly rearranged by people who write very long documents.


The title they went with Seemingly Simple Planning Problems are Computationally Challenging: The Countdown Game Noisy translates that to

Researchers build a test that breaks current AI planning — and it's just arithmetic


Computer scientists created a planning benchmark based on the game Countdown (form a target number from inputs using basic math) that exposes a fundamental weakness in how AI systems think ahead. Unlike existing tests, this one has verifiable right answers, resists memorization, and consistently defeats current language models — meaning the gap between human and machine planning ability is wider than previous benchmarks suggested.
For years, AI researchers have used loose benchmarks to measure planning ability — travel itineraries, game strategies — where it's hard to tell if the AI actually planned or just pattern-matched. This benchmark is mathematically precise: you can prove whether an answer works. What matters is the result: it turns out current language models fail at something humans find trivial, which means the confidence people have in AI planning abilities is based on easier tests than they realized. The implication is uncomfortable: if AI can't reliably solve simple arithmetic planning, the claims about long-term reasoning in more complex domains need skepticism.
Watch whether labs working on AI planning start using Countdown as a standard benchmark over the next 12 months, or whether they avoid it because the results are embarrassing.

If you insist
Read the original →