The world is being quietly rearranged by people who write very long documents.


The title they went with The Chronicles of RiDiC: Generating Datasets with Controlled Popularity Distribution for Long-form Factuality Evaluation Noisy translates that to

Researchers built a dataset to catch AI hallucinations about obscure topics — showing even top models fail on less-famous entities


Computer scientists created a standard test dataset (called RiDiC) specifically designed to measure when AI language models make up false information about real-world things — particularly about less well-known rivers, disasters, and car models. This matters because most AI testing focuses on famous topics where models perform well; this dataset exposes a real failure mode: the less something is written about online, the more likely an AI is to confidently invent facts about it.
AI companies have mostly evaluated their models on easy questions with well-known answers, which hides a critical weakness: models hallucinate more on obscure topics because they've seen less training data about them. This dataset makes that weakness visible and measurable across multiple languages and domains, which means researchers and companies can now benchmark against a real problem instead of pretending it doesn't exist. It's the difference between testing a medical AI on common diagnoses versus rare diseases — you find the failures that matter most in production.
Track whether major AI labs use this dataset in their standard evaluation reports over the next year, or whether they continue using only the benchmarks that make their models look better.

If you insist
Read the original →