Using AI to judge AI outputs doesn't work reliably — and now we know mathematically why

What happened

Researchers discovered that letting large language models score other large language models produces unreliable rankings, especially when using rating scales with more than two levels. In practice, this means companies using AI to evaluate AI performance are getting false confidence in their results, and may be shipping flawed systems.

Why it matters

For years, AI labs have treated LLM judges as a cheap way to evaluate other LLMs — no human review needed, just feed outputs through another model and trust the scores. The problem: they've been ignoring a basic uncertainty question: how confident are we that the judge itself is any good? This paper shows the answer matters. When companies have only two choices to pick between (better or worse), LLM judges are reasonably reliable. But most real benchmarks use 5-point or 7-point scales, where judges become unreliable without explicit modeling of their own quality. This means existing AI leaderboards and model comparison studies may have their rankings backwards or tilted by judge error, not just by the models being judged.

The signal

Watch whether AI labs start publishing uncertainty bands alongside their benchmark scores, or whether they continue releasing single-point rankings without acknowledging judge reliability.