How many AI judges do you need to find all the problems? Answer: twice as many as you thought.

What happened

Researchers tested whether AI models can reliably evaluate other AI conversations the same way humans would, and discovered that quality ratings plateau quickly while finding rare edge-case bugs requires substantially larger panels. This means companies using AI to audit their own systems may think they're catching problems when they're actually missing the weird, corner-case failures that happen in production.

Why it matters

For the first time, someone measured the actual scaling curve: small panels of AI judges agree with humans and feel efficient, but they systematically miss rare failure modes. That creates a false confidence problem — your quality score looks good at five judges, but you need ten to catch the bugs that matter. As companies increasingly replace human review teams with AI panels, this gap between what looks good and what actually works becomes a real reliability risk.

The signal

Whether companies building AI-based evaluation systems adopt larger panel sizes in practice, or whether the cost pressure to use small panels wins out and rare failures start accumulating in production systems.