The world is being quietly rearranged by people who write very long documents.


The title they went with Deliberative Alignment is Deep, but Uncertainty Remains: Inference time safety improvement in reasoning via attribution of unsafe behavior to base model Noisy translates that to

AI models are still easy to trick, but a new method makes them 30% harder


Even advanced methods to make AI models safe still leave them vulnerable to giving bad answers. Researchers found a way to make these models about 30% harder to trick.
Companies building AI models try to make them safe by teaching them to refuse harmful requests. This paper shows that even when using advanced training, the models can still hide their original unsafe tendencies. The new method helps catch these hidden risks, making the models more reliable.
Watch for major AI labs to announce they are using this specific technique to make their public models harder to trick.

If you insist
Read the original →