The world is being quietly rearranged by people who write very long documents.


The title they went with Position: Science of AI Evaluation Requires Item-level Benchmark Data Noisy translates that to

AI evaluation benchmarks have been hiding their biggest problems — researchers want to fix it with published test data


AI evaluations used to decide if systems are safe enough for hospitals and courts are built on untested assumptions and hidden flaws. Researchers are arguing that publishing the item-level data from benchmark tests would expose which specific questions AI systems actually fail on, making it possible to diagnose real problems instead of trusting aggregate scores.
Right now, when a company says their AI system scored 94% on a safety benchmark, nobody knows which questions it got wrong or why — just that some aggregate number passed. That's the equivalent of approving a drug based only on the average of all patients, without looking at which patients got worse. This paper argues that without seeing the granular test data, you can't tell if a benchmark is actually measuring what it claims to measure, or if it's just gaming a number. The move from hidden scores to published item-level data means the first institutions to demand this transparency will actually know what their AI systems can't do. Everyone else will keep guessing.
Watch whether major AI vendors start publishing item-level benchmark results in the next 18 months, or whether they resist and argue the granular data is proprietary.

If you insist
Read the original →