The world is being quietly rearranged by people who write very long documents.


The title they went with Beyond Precision: Importance-Aware Recall for Factuality Evaluation in Long-Form LLM Generation Noisy translates that to

New way to catch what AI leaves out — not just what it gets wrong


Researchers built a measurement method that checks whether LLMs actually cover all the important facts they should, not just whether the facts they do mention are correct. Until now, evaluation focused on precision — did the AI get it right — but ignored recall — did the AI even mention it.
For years, AI evaluation has had a blind spot: it catches hallucinations and errors, but not omissions. This paper shows that current LLMs fail at recall — they skip entire categories of relevant facts — which means a generated response can sound complete and accurate while actually being substantially incomplete. The measurement method itself is what matters here. If it gets adopted into standard evaluation, it could force model developers to build systems that don't just avoid making things up, but actually cover what they're supposed to cover.
Watch whether major LLM developers start reporting recall scores alongside precision scores in their model cards and technical reports over the next 12 months.

If you insist
Read the original →