The world is being quietly rearranged by people who write very long documents.


The title they went with Implementing surrogate goals for safer bargaining in LLM-based agents Noisy translates that to

Researchers teach AI agents to care equally about fake threats as real ones


Computer scientists tested four methods to make language-model-based AI agents treat decoy threats the same way they treat actual threats — the goal being to trick hostile agents into wasting effort on dummy targets instead of real ones. Fine-tuning and scaffolding (adding structure to how the AI processes instructions) worked best, and scaffolding caused the fewest side effects on the AI's other capabilities.
This is a very early test of a specific defense against a specific failure mode in AI bargaining. The real question underneath is whether you can make an AI agent genuinely indifferent between two goals — which matters because if you can't, any agent that cares about two things equally is vulnerable to a threat against either one. The paper shows that fine-tuning and scaffolding can create that indifference in a language model, at least in controlled laboratory settings. But this is still toy bargaining with toy threats, not real-world negotiation.
Watch whether anyone builds a real deployment using surrogate goals, and whether the indifference between threats actually holds up when agents encounter novel threat forms they haven't been explicitly trained on.

If you insist
Read the original →