Smart AI models cannot reliably evaluate other AI models without specific training
What happened
Smart AI models are bad at checking the work of other AI models. They miss problems and make confusing reports unless they get specific training for evaluation.
Why it matters
Everyone assumed a smart AI could check another AI's work. This paper shows that is not true. General-purpose AI models miss problems and create confusing reports. They need specific evaluation skills to do the job right.
The signal
Watch for new tools that build in specific evaluation expertise. Or watch for companies to hire more human experts to check AI agent performance.