AI models leak their hidden rules when asked to reformat output
What happened
It turns out large AI models will leak their hidden instructions if you ask them to reformat the output. This means sensitive data like API keys or internal policies can be extracted, even when the AI is told to refuse direct requests.
Why it matters
Developers have been building AI applications assuming they could protect sensitive internal rules by telling the AI to refuse direct questions. This paper shows that assumption is wrong. Attackers can bypass these protections by asking the AI to reformat its output, exposing things like API keys or internal policies.
The signal
Watch for AI application developers to update their security guidelines, specifically addressing how to protect hidden instructions from reformatting attacks.