What happened
Researchers developed a fast, automated method to spot bad questions in large AI benchmarks and standardized tests — questions that are confusing, wrong, or don't test what they claim to test. This matters because AI systems are now evaluated on thousands of questions, but nobody has time to hand-check each one for quality, so bad questions silently corrupt the measurements.
Why it matters
When you measure AI performance on a benchmark full of broken questions, you're not measuring the AI — you're measuring the benchmark. This method is the first practical way to catch those broken items at scale without hiring psychometricians to read thousands of test questions by hand.