What happened
Researchers created a benchmark that tests medical AI systems the way doctors are actually tested — by simulating real conversations where the AI must gather patient history, analyze medical images and lab reports, and make diagnoses. This is a shift away from traditional medical benchmarks that ask AI systems to answer multiple-choice questions, toward evaluating whether AI can reason through the messy reality of actual clinical work.
Why it matters
Medical AI benchmarks have so far measured whether systems can answer exam questions correctly, but that tells you almost nothing about whether an AI could actually help a doctor in a real clinic. This benchmark simulates what doctors actually do — ask questions, read results, think through possibilities — which means it can measure something closer to real clinical competence rather than test-taking ability.