AI judges used to evaluate other AI are vulnerable to the same attacks they're supposed to catch

What happened

Researchers surveyed 863 papers on using language models to judge the quality of other AI outputs and found a basic problem: the judging AI can be fooled, hacked, or manipulated just like any other system. This means evaluation pipelines that rely on AI to assess AI quality are built on a foundation that hasn't been secured yet.

Why it matters

As AI systems become harder for humans to evaluate by hand, organizations are increasingly using other AI models to do the rating instead—to score whether a response is correct, safe, or good enough. But this creates a single point of failure: if you can trick the judge, you've broken the entire evaluation process. The paper is the first to map out where these vulnerabilities exist and how attackers could exploit them. This matters because it reveals that many current AI systems claiming to have passed safety checks may not have actually done so.

The signal

Watch whether AI safety evaluations start requiring human spot-checks or adversarial testing of the judges themselves, rather than treating the judge's verdict as final.