What happened
Researchers created the first benchmark testing whether AI language models can answer questions about entire software projects—not just isolated code snippets—by collecting 1,318 real developer questions across 134 open-source projects. The AI systems performed only moderately well, and when they did answer correctly, they were often just repeating answers found online rather than actually understanding how the code worked together.
Why it matters
This is the first empirical evidence showing that current AI tools can't reliably understand how real software systems actually work at scale, which matters because companies are increasingly betting on AI to help developers navigate large codebases—but the AI is essentially pattern-matching answers rather than reasoning through code.