LLMs can find less than half the bugs in real software — and bug-hunting is much harder than code generation

What happened

Researchers built a benchmark with 30 games and 124 verified bugs to test whether large language models can autonomously find software defects. The best model caught only 48% of them, showing that autonomous bug discovery remains far harder than writing code — the random, interactive nature of runtime environments defeats current AI approaches.

Why it matters

For years, the assumption was that if LLMs could generate code, they could also find bugs in it. This paper shows the problem is structural: finding bugs requires exploring the actual running software in unpredictable ways, not just analyzing static text. This means autonomous software quality assurance — the kind that could replace human testers — is not a near-term problem. It also means companies betting on LLM-based QA tools are building on shaky ground.

The signal

Watch whether LLM developers focus resources on interactive bug-finding (giving models better ways to explore runtime behavior) or quietly abandon the problem and focus on code generation instead, where the economics are clearer.