Researchers build AI defense that learns from repeated attacks — 79% success rate blocking them
What happened
A team created a system where multiple AI agents work together to defend language models against attacks that get smarter across multiple rounds of interaction. Instead of blocking attacks the same way each time, the system remembers what happened before and adapts its defenses, reducing successful attacks by nearly 80% compared to existing defenses.
Why it matters
Language models are now being deployed in real systems where adversaries can attack them repeatedly, refining their approach each round — like a hacker probing a bank vault multiple times. Previous defenses were static, designed to block one attack the same way every time, which doesn't work when attackers learn and shift tactics. This research shows that defenses can themselves learn and adapt across rounds, which means the cost of attacking a deployed language model just went up significantly.
The signal
Watch whether this defense gets tested against adversaries who know the defense exists and design attacks specifically to evade it — the real test is not whether it blocks known attacks, but whether it holds against attackers who see it coming.