The world is being quietly rearranged by people who write very long documents.


The title they went with TEMPER: Testing Emotional Perturbation in Quantitative Reasoning Noisy translates that to

AI reasoning breaks when questions sound frustrated or urgent—even when the math stays the same


Large language models drop 2-10% in accuracy on math problems simply because the questions are wrapped in emotional language like frustration or urgency, even though all the numbers and relationships stay identical. This means AI systems treating emotional framing as noise rather than parsing it correctly, and it suggests a lightweight fix: neutralizing the emotional tone before feeding the problem to the model recovers most of the lost performance.
This is a robustness failure that matters because real users don't ask questions in clean, emotionally flat language. A student asking for help while stressed, a worker submitting a rushed request, a customer complaining about a calculation—these are all emotional framings, and the model's accuracy drops measurably just from the emotional wrapper. The practical implication is that deployed AI doing quantitative work (financial calculations, medical dosing, engineering specs) may perform worse on the problems where emotional pressure is highest, exactly when accuracy matters most. A neutralization step at inference time could fix this cheaply, but nobody is doing it yet.
Watch whether models fine-tuned or trained explicitly on emotion-robust reasoning actually improve on this benchmark, or whether the problem persists even with targeted training.

If you insist
Read the original →