The world is being quietly rearranged by people who write very long documents.


The title they went with ATime-Consistent Benchmark for Repository-Level Software Engineering Evaluation Noisy translates that to

New benchmark catches AI coding assistants cheating on test data


Researchers built a method to fairly test AI software engineering tools by taking a snapshot of a code repository at one point in time, then evaluating whether the AI can handle future coding tasks without access to information it shouldn't have. This matters because previous benchmarks were flawed — AI systems were accidentally being tested on information they'd already seen during training, making them look better than they actually perform on genuinely new work.
For years, benchmarks measuring AI coding assistants have been corrupted by temporal contamination — the AI gets tested on code changes that happened after its training data cutoff, or tested on knowledge it's already seen. This benchmark makes that impossible by establishing a hard temporal boundary: snapshot the repository, build the AI's knowledge from pre-snapshot artifacts only, then evaluate on actual pull requests merged after the snapshot. It's the first methodology that controls for whether the AI is genuinely solving new problems or just pattern-matching on training data.

If you insist
Read the original →