The world is being quietly rearranged by people who write very long documents.


The title they went with CresOWLve: Benchmarking Creative Problem-Solving Over Real-World Knowledge Noisy translates that to

AI models can retrieve facts but fail at creative leaps — new benchmark shows the gap


Researchers built a benchmark that tests whether AI language models can actually solve real-world puzzles by combining knowledge from different domains, not just answer factual questions. The models retrieved relevant information correctly but failed to make the non-obvious creative connections needed to solve the problems, dropping accuracy by up to 17 percentage points when creativity was required.
This is the first benchmark that separates two different things: whether an AI model knows a fact versus whether it can use facts creatively to solve a novel problem. Most existing benchmarks measure only the first. What matters here is that the gap is enormous and systematic — the models consistently choked on the creative integration step, even when they had the raw knowledge. This suggests AI models are good at retrieval and pattern-matching but struggle with the kind of lateral thinking humans do naturally when solving unfamiliar problems.
Whether downstream AI applications that depend on creative problem-solving (research, design, strategy work, diagnosis in medicine or law) show similar performance drops when tested on real-world scenarios instead of synthetic tasks.

If you insist
Read the original →