New benchmark reveals how poorly AI handles full codebases

What happened

Researchers created ReCUBE, a test that measures whether large language models can actually use an entire software repository's context to write code correctly. The test shows that even the best AI models fail most of the time on this task—suggesting that despite claims about AI coding assistants, they're still quite limited when dealing with real-world, interconnected code.

Why it matters

This is the first measurement that isolates what AI actually struggles with in real coding: understanding how different files depend on and call each other across a whole project. Until now, benchmarks tested individual coding tasks in isolation, hiding the gap between 'can write a function' and 'can write code that integrates with a 50-file codebase.'