The world is being quietly rearranged by people who write very long documents.


The title they went with Steering the Verifiability of Multimodal AI Hallucinations Noisy translates that to

Researchers learn to control which AI hallucinations humans can catch


A team built a dataset showing that some AI errors are obvious to spot and others are nearly invisible, then demonstrated they could steer an AI system to produce more obvious errors and fewer hidden ones. This means AI safety engineers can now trade off between usability and detectability — make the system more obviously wrong when accuracy matters, or harder to catch when the app tolerates some error.
Until now, AI hallucinations were treated as a binary problem: either the system hallucinates or it doesn't. This paper shows hallucinations exist on a spectrum of verifiability, and you can control where on that spectrum the system operates. That's a shift from 'fix hallucinations' to 'choose what kind of hallucination your users will encounter.' The practical implication is immediate: a high-stakes application like medical diagnosis can demand obvious errors only (so doctors catch them), while a low-stakes application like recommendation summaries can tolerate more elusive errors without triggering the same verification burden. The intervention method they describe works by identifying which parts of the AI's internal processing create detectable versus undetectable errors — essentially learning the difference in how the system 'thinks' when it makes obvious mistakes versus hidden ones.
The question is whether safety teams actually adopt this approach in production systems, or whether the activation-space probes prove fragile when scaled beyond the 4,470 examples they tested on.

If you insist
Read the original →