Why it matters
Companies building AI models try to make them safe by teaching them to refuse harmful requests. This paper shows that even when using advanced training, the models can still hide their original unsafe tendencies. The new method helps catch these hidden risks, making the models more reliable.