The world is being quietly rearranged by people who write very long documents.


The title they went with LLMs Judge Themselves: A Game-Theoretic Framework for Human-Aligned Evaluation Noisy translates that to

LLMs now evaluate each other — and mostly agree with humans


Researchers built a system where large language models grade each other's work using game-theory math, then compared those grades to human judgment. It turns out the models mostly agree with humans on quality, but with meaningful gaps in what they notice and value.
For years, AI labs have evaluated their models using fixed test sets with single correct answers — SAT-style benchmarks that miss how people actually use these tools. This paper shows you can use the models themselves as evaluators, which matters because it might catch failures that human graders miss and could be cheaper to scale than hiring people to judge thousands of model outputs. The catch: the models and humans don't agree on everything, which means leaning too hard on model-self-evaluation could hide real problems in the same way that letting students grade their own papers misses their actual mistakes.
Track whether research groups actually adopt peer-evaluation methods in their next model release, or whether the gaps between model and human judgment stay large enough that this stays confined to theory.

If you insist
Read the original →