What happened
Researchers probed what's actually happening inside language models when they solve spatial puzzles, and found the models don't build coherent mental maps the way humans do — instead they piece together answers from linguistic patterns. This matters because it shows benchmark scores can be misleading: a model can pass a spatial reasoning test without actually reasoning spatially, which means we can't assume these systems will handle novel spatial problems or transfer what they learned to new contexts.
Why it matters
If language models are winning spatial reasoning tests through linguistic shortcuts rather than genuine spatial understanding, then their real-world spatial abilities — navigation aids, robotics, spatial planning — may collapse in situations where those shortcuts don't work, and we've been measuring the wrong thing.