The world is being quietly rearranged by people who write very long documents.


The title they went with Robust LLM Performance Certification via Constrained Maximum Likelihood Estimation Noisy translates that to

How to actually measure when AI systems fail — without paying humans to check everything


Researchers found a way to estimate AI failure rates using a mix of human-checked examples, AI self-grading, and domain knowledge constraints — cutting the cost of safe deployment. Instead of choosing between expensive human review and unreliable AI self-assessment, this method uses all three sources together to get accurate failure estimates at scale.
Right now, deploying a language model safely requires either hiring humans to laboriously check outputs or trusting the AI to grade itself, which often produces garbage data. This paper shows a concrete way to do both at once — use a small set of human-verified examples to calibrate the AI's self-grading, then apply what you know about the AI's actual blindspots to correct for systematic bias. That means the expensive human-review bottleneck might actually be solvable without just accepting lower quality.
Track whether companies actually adopt this method in production systems within 12 months, or whether deployment decisions still default to either expensive human review or ignore failure-rate estimation entirely.

If you insist
Read the original →