Safety training for AI language models doesn't generalize. Combined attacks break defenses in 71% of cases.

What happened

Researchers found that when you combine multiple attack techniques together, they can break through the safety protections on large language models far more often than any single attack could. This means the defenses that make these models safe aren't robust — they're brittle in ways that matter.

Why it matters

The standard way to make language models safe (reinforcement learning from human feedback) works by reshaping how the model prioritizes its existing capabilities, not by teaching it genuinely new safety behaviors. That's a structural problem: if safety is just a reweighting of what's already inside, then attackers who know how to combine techniques can find gaps the training never anticipated. This shifts the entire problem from 'we trained it to be safe' to 'we trained it to avoid obvious attacks, but we don't know what we didn't train for.' The gap between what safety training covers and what it actually generalizes to is much wider than assumed.

The signal

Watch whether major language model releases after this paper includes compound attack testing in their safety evaluation reports, or whether they continue publishing single-attack breakage rates as the main metric.