The world is being quietly rearranged by people who write very long documents.


The title they went with Security in LLM-as-a-Judge: A Comprehensive SoK Noisy translates that to

AI judges used to evaluate other AI are vulnerable to the same attacks they're supposed to catch


Researchers surveyed 863 papers on using language models to judge the quality of other AI outputs and found a basic problem: the judging AI can be fooled, hacked, or manipulated just like any other system. This means evaluation pipelines that rely on AI to assess AI quality are built on a foundation that hasn't been secured yet.
As AI systems become harder for humans to evaluate by hand, organizations are increasingly using other AI models to do the rating instead—to score whether a response is correct, safe, or good enough. But this creates a single point of failure: if you can trick the judge, you've broken the entire evaluation process. The paper is the first to map out where these vulnerabilities exist and how attackers could exploit them. This matters because it reveals that many current AI systems claiming to have passed safety checks may not have actually done so.
Watch whether AI safety evaluations start requiring human spot-checks or adversarial testing of the judges themselves, rather than treating the judge's verdict as final.

If you insist
Read the original →