Researchers build a way to check if AI judges are actually consistent — and find that asking them to explain themselves helps

What happened

A team developed a method that breaks down how AI language models evaluate text into explicit steps and measures their confidence at each step, rather than just asking them for a single score. When tested, the step-by-step approach with uncertainty tracking produced more stable and consistent judgments than direct scoring, especially when comparing two outputs where the right answer wasn't obvious.

Why it matters

AI systems are now being used to grade everything from student work to medical summaries to legal documents, but no one actually knows if they're judging fairly or consistently. This paper shows that forcing an AI to break down its reasoning into criteria and admit uncertainty produces better results than letting it guess. The catch is immediate: if you can improve consistency just by changing how you ask the question, every AI evaluation system in production right now is probably worse than it needs to be.

The signal

Watch whether AI evaluation products — for code review, content moderation, legal research — start implementing structured scoring with confidence bounds instead of single scores, and whether those changes measurably reduce disagreement when the same AI is asked to re-evaluate the same output weeks later.