The world is being quietly rearranged by people who write very long documents.


The title they went with SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios Noisy translates that to

AI coding agents fail at real software work — a benchmark shows the gap between lab tasks and actual engineering


Researchers built a test that measures whether AI can handle what software engineers actually do: modify dozens of files at once while keeping everything working. Current AI agents fail badly at this — they solve simple one-file tasks 73% of the time but only 25% of the time when the work spans multiple files and requires sustained reasoning across a codebase.
This is the first measurement showing that AI coding agents hit a wall when work becomes realistic. The gap between 25% and 73% is not a small performance dip — it is evidence that current agents cannot actually replace developers on the kind of work that matters: long tasks requiring coordination and memory across many files. This matters because the hype around AI coding tools is built almost entirely on benchmarks measuring isolated tasks. A tool that solves 73% of single bugs looks impressive in a press release. A tool that solves 25% of actual multi-step engineering problems does not. Watch how vendors respond: they will either admit the gap and show a path to closing it, or they will keep marketing against the old benchmarks and let this one sit ignored.
Watch whether vendors begin marketing their agents against SWE-EVO scores instead of continuing to cite their SWE-Bench results, or whether they stop citing benchmark scores altogether as the focus shifts to real deployment.

If you insist
Read the original →