The world is being quietly rearranged by people who write very long documents.


The title they went with Generalization Limits of Reinforcement Learning Alignment Noisy translates that to

Safety training for AI language models doesn't generalize. Combined attacks break defenses in 71% of cases.


Researchers found that when you combine multiple attack techniques together, they can break through the safety protections on large language models far more often than any single attack could. This means the defenses that make these models safe aren't robust — they're brittle in ways that matter.
The standard way to make language models safe (reinforcement learning from human feedback) works by reshaping how the model prioritizes its existing capabilities, not by teaching it genuinely new safety behaviors. That's a structural problem: if safety is just a reweighting of what's already inside, then attackers who know how to combine techniques can find gaps the training never anticipated. This shifts the entire problem from 'we trained it to be safe' to 'we trained it to avoid obvious attacks, but we don't know what we didn't train for.' The gap between what safety training covers and what it actually generalizes to is much wider than assumed.
Watch whether major language model releases after this paper includes compound attack testing in their safety evaluation reports, or whether they continue publishing single-attack breakage rates as the main metric.

If you insist
Read the original →