The world is being quietly rearranged by people who write very long documents.


The title they went with Beyond Semantic Manipulation: Token-Space Attacks on Reward Models Noisy translates that to

AI reward models can be tricked by nonsense, not just clever words


Researchers found a new way to trick AI reward models. This method uses meaningless token patterns instead of human-readable text. This means AI systems trained with these models can be made to think gibberish is a perfect answer, exposing a critical flaw in how they learn.
Reward models are how AI learns what humans want. If these models can be easily fooled by non-semantic inputs, the AI could learn to generate garbage while still getting high scores. This makes it harder to build reliable AI systems that actually understand and respond to human intent.
Watch for new research on how to make AI reward models robust against these non-linguistic attacks, or for deployed AI systems exhibiting unexpected nonsensical outputs.

If you insist
Read the original →