The world is being quietly rearranged by people who write very long documents.


The title they went with SWE-PRBench: Benchmarking AI Code Review Quality Against Pull Request Feedback Noisy translates that to

AI code review catches only 15-31% of human-flagged bugs


Researchers built a benchmark of 350 real pull requests with human-verified code issues and tested 8 leading AI models against it. Every model performed far worse than human reviewers — catching only a fraction of actual problems — and got worse when given more context, suggesting AI struggles with the kind of nuanced judgment that experienced developers rely on.
This is the first systematic measurement showing that despite AI's strong performance on synthetic coding tasks, it fails dramatically on the real work software engineers do — which means AI code review cannot yet replace human reviewers and companies relying on it for safety-critical code may have a false sense of security.

If you insist
Read the original →