Researchers built a test to see if AI can plan multi-step trips — it mostly can't

What happened

Researchers created a benchmark that tests whether large language models can plan complex real-world routes, like finding an EV charging station that also has a nearby restaurant. The models can handle individual tasks but fail at exploring distant options or balancing competing goals — a gap that matters for any application requiring spatial reasoning.

Why it matters

This is a measurement problem. Right now, nobody has a standard way to test whether AI systems can actually reason about geography and competing constraints — the kinds of decisions humans make constantly when planning a trip or a delivery route. The benchmark itself (called EVGeoQA) is the signal: it names a capability gap that was previously invisible. Once you can measure something, you can start improving it. The finding that AI models struggle with long-range spatial exploration while succeeding at immediate tasks tells you exactly where the bottleneck is.

The signal

Watch whether other research teams adopt this benchmark or build competing ones — that's the sign this measurement standard is becoming real infrastructure, not just a paper.