LLMs now evaluate each other — and mostly agree with humans

What happened

Researchers built a system where large language models grade each other's work using game-theory math, then compared those grades to human judgment. It turns out the models mostly agree with humans on quality, but with meaningful gaps in what they notice and value.

Why it matters

For years, AI labs have evaluated their models using fixed test sets with single correct answers — SAT-style benchmarks that miss how people actually use these tools. This paper shows you can use the models themselves as evaluators, which matters because it might catch failures that human graders miss and could be cheaper to scale than hiring people to judge thousands of model outputs. The catch: the models and humans don't agree on everything, which means leaning too hard on model-self-evaluation could hide real problems in the same way that letting students grade their own papers misses their actual mistakes.

The signal

Track whether research groups actually adopt peer-evaluation methods in their next model release, or whether the gaps between model and human judgment stay large enough that this stays confined to theory.