Medical AI confidently gives wrong answers — and a cheap fix actually works

What happened

Researchers tested three popular medical AI models across multiple sizes and found they all suffer from the same problem: they're overconfident even when they're wrong, and no amount of prompting or scaling fixes it. But a simple statistical adjustment applied after the model finishes its prediction — similar to techniques used in spam filters for decades — cuts the overconfidence in half and actually improves accuracy on harder questions.

Why it matters

Medical AI is moving into clinical decision support where a doctor trusts the model's confidence score to decide whether to act on its answer. If the model says it's 95% sure but is actually right only 70% of the time, doctors will trust wrong answers. The researchers found this miscalibration persists across model families and model sizes, which means it's not a training problem that scales away. The practical fix they tested — applying a statistical correction after the model runs — is boring, established technique that works and costs almost nothing to deploy. Expect hospitals and vendors to adopt this before rolling out medical AI in any serious way.

The signal

Watch whether the next generation of FDA-cleared medical AI systems disclose their calibration scores alongside accuracy, or whether vendors bury this data because it makes their models look worse than raw accuracy alone.