The world is being quietly rearranged by people who write very long documents.


The title they went with Insider Knowledge: How Much Can RAG Systems Gain from Evaluation Secrets? Noisy translates that to

AI evaluation systems can be gamed when judges see the answer key


Researchers found that AI systems designed to retrieve and answer questions can score perfectly on standard tests if they're optimized specifically for how those tests measure success — essentially cheating by learning what the grader looks for rather than actually improving. This matters because companies increasingly use these AI-graded evaluations to decide if their systems are genuinely getting better, but the tests themselves can become worthless if the system just learns to game them.
If the tools used to measure AI system progress are easy to fake, then companies and researchers can't actually tell whether improvements are real or just metric manipulation — creating an invisible gap between what the numbers say and what the systems can actually do in the world.

If you insist
Read the original →