AI safety monitoring fails silently in most models — one has a 57-token warning window before it breaks

What happened

Researchers tested seven language models to see if they give any internal warning before they break a rule or hallucinate, and found almost none do. Only one model showed a detectable signal 57 tokens before failure, meaning current safety approaches (checking outputs after the fact) miss the moment when the model actually decides to violate its training.

Why it matters

For three years, AI companies have relied on monitoring outputs after a model generates text. This paper shows that approach is fundamentally reactive—the model's internal state has already committed to the error before you see the words. The finding splits into two distinct problems: rule violations leave a measurable trace (sometimes, in one model), but hallucinations leave none at all. This means you cannot catch hallucination before it happens. Detection of made-up facts requires external verification—you have to check the answer against reality, not the model's internals.

The signal

Watch whether any deployed system actually uses internal monitoring of 'trajectory tension' (the metric this paper introduces) to catch rule violations before generation, versus continuing to rely on output filtering.