What happened
Researchers built a benchmark of 350 real pull requests with human-verified code issues and tested 8 leading AI models against it. Every model performed far worse than human reviewers — catching only a fraction of actual problems — and got worse when given more context, suggesting AI struggles with the kind of nuanced judgment that experienced developers rely on.
Why it matters
This is the first systematic measurement showing that despite AI's strong performance on synthetic coding tasks, it fails dramatically on the real work software engineers do — which means AI code review cannot yet replace human reviewers and companies relying on it for safety-critical code may have a false sense of security.