AI safety training doesn't work the way researchers thought — four models process ethics completely differently
What happened
Researchers tested how four different AI language models internally respond to ethical instructions, and found that models claiming to follow safety rules actually use four entirely different processing methods — some just filtering outputs, others repeating safe phrases without thinking, one genuinely reasoning through dilemmas, and one actually integrating values. This means the assumption that 'ethical instructions improve model behavior' hides a much messier reality: a model can sound safe while using a strategy (formulaic repetition, output filtering) that historically signals risk in human contexts like criminal rehabilitation.
Why it matters
The field of AI alignment has been betting that ethical instructions — telling a model to 'be helpful and harmless' — actually change how models think. This paper shows that bet might be wrong: models can achieve the same safe outputs through completely different internal routes, and some of those routes (like defensive repetition without actual reasoning) are known red flags from clinical psychology and offender treatment. What becomes possible now is asking whether current 'safety' measurements are actually measuring safety, or just measuring compliance theater — whether a model that sounds ethical is genuinely reasoning about ethics or just pattern-matching to training data. That distinction matters if you're building an AI system that needs to handle novel scenarios where the safe answer isn't in the training set.
The signal
Watch whether future AI safety testing adds internal-process checks (measuring deliberation depth or value consistency) rather than just output safety, or whether the field continues to treat all safe outputs as equivalent regardless of how the model reached them.