Fact-checkers trained on one benchmark fail on research reports, but humans get better when they can challenge the benchmark

What happened

Researchers discovered that existing fact-checking benchmarks don't work well for long research documents — experts could only verify 60% of claims correctly on the first pass. They built a new system where fact-checkers and humans argue it out, with a referee resolving disputes and updating the benchmark each round, which raised expert accuracy to 91% by round four.

Why it matters

For years, AI safety researchers have assumed that if you train a fact-checker on labeled examples, it will work on new documents. This paper shows the opposite: the label itself might be wrong, and you only find out when someone has to actually defend it. That changes how you build systems that need to verify claims in specialized domains. Instead of locking in labels and hoping they transfer, you need a process where the labels evolve — which is slower upfront but produces benchmarks that survive contact with real work.

The signal

Watch whether other researchers adopt the audit-then-score method for their own benchmarks, or whether this stays a one-off fix for the factuality problem.