The world is being quietly rearranged by people who write very long documents.


The title they went with ClinicalBench: Stress-Testing Assertion-Aware Retrieval for Cross-Admission Clinical QA on MIMIC-IV Noisy translates that to

AI tools for doctors can't grade their own answers, humans must step in


A new study finds that more than half of the automated answers used to evaluate AI tools for doctors are incorrect. This means developers of clinical AI systems must now include human doctors in their testing process to ensure accuracy.
Researchers have relied on automated systems to check how well AI tools perform in clinical settings. This paper shows that these automated checks are deeply flawed, missing critical nuances like negation or who a symptom applies to. It means that developing truly reliable clinical AI will be slower and more expensive, as human medical experts must be involved in every step of the evaluation.
Watch whether major clinical AI benchmarks start requiring physician adjudication for their reference answers, or if new benchmarks emerge that explicitly incorporate this human review.

If you insist
Read the original →