AI evaluation benchmarks have been hiding their biggest problems — researchers want to fix it with published test data
What happened
AI evaluations used to decide if systems are safe enough for hospitals and courts are built on untested assumptions and hidden flaws. Researchers are arguing that publishing the item-level data from benchmark tests would expose which specific questions AI systems actually fail on, making it possible to diagnose real problems instead of trusting aggregate scores.
Why it matters
Right now, when a company says their AI system scored 94% on a safety benchmark, nobody knows which questions it got wrong or why — just that some aggregate number passed. That's the equivalent of approving a drug based only on the average of all patients, without looking at which patients got worse. This paper argues that without seeing the granular test data, you can't tell if a benchmark is actually measuring what it claims to measure, or if it's just gaming a number. The move from hidden scores to published item-level data means the first institutions to demand this transparency will actually know what their AI systems can't do. Everyone else will keep guessing.
The signal
Watch whether major AI vendors start publishing item-level benchmark results in the next 18 months, or whether they resist and argue the granular data is proprietary.