The world is being quietly rearranged by people who write very long documents.


The title they went with SWE Context Bench: A Benchmark for Context Learning in Coding Noisy translates that to

New benchmark measures whether AI coding agents can learn from past problems


Researchers created a test that measures whether large language models used for coding can reuse what they learned from one problem to solve similar problems faster and cheaper, something previous benchmarks didn't actually measure. This matters because if AI agents can't reuse experience efficiently, they'll be much slower and more expensive to use for real software engineering work — and if they can, that changes the economics of automated coding.
Until now, benchmarks tested whether AI coding agents could solve individual programming tasks, but ignored whether they could learn from context and apply it to related problems — the way a human programmer would. This benchmark reveals that agents can save significant time and cost if given the right summarized context, but randomly adding context actually hurts performance, which means the real bottleneck isn't the model's capability — it's the ability to select and represent relevant experience.

If you insist
Read the original →