Researchers learn to control which AI hallucinations humans can catch

What happened

A team built a dataset showing that some AI errors are obvious to spot and others are nearly invisible, then demonstrated they could steer an AI system to produce more obvious errors and fewer hidden ones. This means AI safety engineers can now trade off between usability and detectability — make the system more obviously wrong when accuracy matters, or harder to catch when the app tolerates some error.

Why it matters

Until now, AI hallucinations were treated as a binary problem: either the system hallucinates or it doesn't. This paper shows hallucinations exist on a spectrum of verifiability, and you can control where on that spectrum the system operates. That's a shift from 'fix hallucinations' to 'choose what kind of hallucination your users will encounter.' The practical implication is immediate: a high-stakes application like medical diagnosis can demand obvious errors only (so doctors catch them), while a low-stakes application like recommendation summaries can tolerate more elusive errors without triggering the same verification burden. The intervention method they describe works by identifying which parts of the AI's internal processing create detectable versus undetectable errors — essentially learning the difference in how the system 'thinks' when it makes obvious mistakes versus hidden ones.

The signal

The question is whether safety teams actually adopt this approach in production systems, or whether the activation-space probes prove fragile when scaled beyond the 4,470 examples they tested on.