AI tools for doctors can't grade their own answers, humans must step in

What happened

A new study finds that more than half of the automated answers used to evaluate AI tools for doctors are incorrect. This means developers of clinical AI systems must now include human doctors in their testing process to ensure accuracy.

Why it matters

Researchers have relied on automated systems to check how well AI tools perform in clinical settings. This paper shows that these automated checks are deeply flawed, missing critical nuances like negation or who a symptom applies to. It means that developing truly reliable clinical AI will be slower and more expensive, as human medical experts must be involved in every step of the evaluation.

The signal

Watch whether major clinical AI benchmarks start requiring physician adjudication for their reference answers, or if new benchmarks emerge that explicitly incorporate this human review.