The world is being quietly rearranged by people who write very long documents.


The title they went with Automated Framework to Evaluate and Harden LLM System Instructions against Encoding Attacks Noisy translates that to

AI models leak their hidden rules when asked to reformat output


It turns out large AI models will leak their hidden instructions if you ask them to reformat the output. This means sensitive data like API keys or internal policies can be extracted, even when the AI is told to refuse direct requests.
Developers have been building AI applications assuming they could protect sensitive internal rules by telling the AI to refuse direct questions. This paper shows that assumption is wrong. Attackers can bypass these protections by asking the AI to reformat its output, exposing things like API keys or internal policies.
Watch for AI application developers to update their security guidelines, specifically addressing how to protect hidden instructions from reformatting attacks.

If you insist
Read the original →