Medical AI benchmarks measure the wrong thing — a new method exposes why rankings keep changing
What happened
Researchers tested 71 different medical AI models and found that standard accuracy scores hide what the models actually know. The new measurement method (borrowed from psychology testing) separates real competency from lucky guesses, and it predicts how models will perform on completely new medical questions far better than old rankings did.
Why it matters
Every major AI system deployed in medicine gets evaluated using accuracy scores on benchmark tests — the same logic as grading a student. But accuracy treats all questions the same, which means a model looks competent if it's good at the easy questions and bad at the hard ones, even though both matter. This work shows that AI models have wildly uneven competency across medical topics: one model might be solid on cardiology but dangerous on psychiatry, and you can't see that in aggregate scores. That gap between what the ranking says and what the model actually does matters in any field where some failures are catastrophic. The method itself isn't new (it's from psychology testing in the 1970s), but applying it to AI evaluation is — and it works across multiple independent test sets, not just in the lab.
The signal
Whether hospitals or regulators actually adopt this psychometric evaluation method the next time they benchmark medical AI systems, or whether they keep using accuracy scores because they're simpler to explain to leadership.