The world is being quietly rearranged by people who write very long documents.


The title they went with ReCUBE: Evaluating Repository-Level Context Utilization in Code Generation Noisy translates that to

New benchmark reveals how poorly AI handles full codebases


Researchers created ReCUBE, a test that measures whether large language models can actually use an entire software repository's context to write code correctly. The test shows that even the best AI models fail most of the time on this task—suggesting that despite claims about AI coding assistants, they're still quite limited when dealing with real-world, interconnected code.
This is the first measurement that isolates what AI actually struggles with in real coding: understanding how different files depend on and call each other across a whole project. Until now, benchmarks tested individual coding tasks in isolation, hiding the gap between 'can write a function' and 'can write code that integrates with a 50-file codebase.'

If you insist
Read the original →