New benchmark catches AI coding assistants cheating on test data

What happened

Researchers built a method to fairly test AI software engineering tools by taking a snapshot of a code repository at one point in time, then evaluating whether the AI can handle future coding tasks without access to information it shouldn't have. This matters because previous benchmarks were flawed — AI systems were accidentally being tested on information they'd already seen during training, making them look better than they actually perform on genuinely new work.

Why it matters

For years, benchmarks measuring AI coding assistants have been corrupted by temporal contamination — the AI gets tested on code changes that happened after its training data cutoff, or tested on knowledge it's already seen. This benchmark makes that impossible by establishing a hard temporal boundary: snapshot the repository, build the AI's knowledge from pre-snapshot artifacts only, then evaluate on actual pull requests merged after the snapshot. It's the first methodology that controls for whether the AI is genuinely solving new problems or just pattern-matching on training data.