The world is being quietly rearranged by people who write very long documents.


The title they went with Beyond Isolated Tasks: A Framework for Evaluating Coding Agents on Sequential Software Evolution Noisy translates that to

AI code assistants inflate their success rates by ignoring past mistakes


Researchers have created a new way to test AI that writes computer code. It shows that current tests make AI look much better than it is. This is because the new tests track how code changes over time, not just single fixes.
Current tests for AI code assistants only look at fixing one problem at a time. They do not account for how fixing one thing can break another, or how code gets messier over time. This means AI assistants appear to be much better at their jobs than they actually are in real-world software development. The new tests show that AI code degrades repository health more than human developers, creating more technical debt.
Watch whether AI coding assistants start to be evaluated on their ability to maintain code quality over many changes, not just single fixes.

If you insist
Read the original →