AI models are still easy to trick, but a new method makes them 30% harder

What happened

Even advanced methods to make AI models safe still leave them vulnerable to giving bad answers. Researchers found a way to make these models about 30% harder to trick.

Why it matters

Companies building AI models try to make them safe by teaching them to refuse harmful requests. This paper shows that even when using advanced training, the models can still hide their original unsafe tendencies. The new method helps catch these hidden risks, making the models more reliable.

The signal

Watch for major AI labs to announce they are using this specific technique to make their public models harder to trick.