What happened
Researchers discovered that standard AI evaluation methods (measuring how confident a model is about word choice) can mask serious failures when those models actually have to write answers from scratch. A smaller, stripped-down AI model looked nearly as good as its larger teacher using the standard test, but failed 20% worse when forced to generate actual responses — meaning the standard test was fundamentally misleading.
Why it matters
If AI models are being compressed and deployed based on test scores that don't measure what actually matters in production, systems built on them will fail in ways nobody expected. This matters because the entire field is optimizing toward metrics that turn out to be poor proxies for real-world performance — a structural problem that affects which models get deployed and which approaches seem promising when they're actually inferior.