The world is being quietly rearranged by people who write very long documents.


The title they went with Logarithmic Scores, Power-Law Discoveries: Disentangling Measurement from Coverage in Agent-Based Evaluation Noisy translates that to

How many AI judges do you need to find all the problems? Answer: twice as many as you thought.


Researchers tested whether AI models can reliably evaluate other AI conversations the same way humans would, and discovered that quality ratings plateau quickly while finding rare edge-case bugs requires substantially larger panels. This means companies using AI to audit their own systems may think they're catching problems when they're actually missing the weird, corner-case failures that happen in production.
For the first time, someone measured the actual scaling curve: small panels of AI judges agree with humans and feel efficient, but they systematically miss rare failure modes. That creates a false confidence problem — your quality score looks good at five judges, but you need ten to catch the bugs that matter. As companies increasingly replace human review teams with AI panels, this gap between what looks good and what actually works becomes a real reliability risk.
Whether companies building AI-based evaluation systems adopt larger panel sizes in practice, or whether the cost pressure to use small panels wins out and rare failures start accumulating in production systems.

If you insist
Read the original →