The world is being quietly rearranged by people who write very long documents.


The title they went with GBQA: A Game Benchmark for Evaluating LLMs as Quality Assurance Engineers Noisy translates that to

LLMs can find less than half the bugs in real software — and bug-hunting is much harder than code generation


Researchers built a benchmark with 30 games and 124 verified bugs to test whether large language models can autonomously find software defects. The best model caught only 48% of them, showing that autonomous bug discovery remains far harder than writing code — the random, interactive nature of runtime environments defeats current AI approaches.
For years, the assumption was that if LLMs could generate code, they could also find bugs in it. This paper shows the problem is structural: finding bugs requires exploring the actual running software in unpredictable ways, not just analyzing static text. This means autonomous software quality assurance — the kind that could replace human testers — is not a near-term problem. It also means companies betting on LLM-based QA tools are building on shaky ground.
Watch whether LLM developers focus resources on interactive bug-finding (giving models better ways to explore runtime behavior) or quietly abandon the problem and focus on code generation instead, where the economics are clearer.

If you insist
Read the original →