What happened
Researchers proved mathematically that a newer training algorithm called Muon can store and learn more information per computational step than the standard method (SGD), especially when training on associative memory tasks that mimic how transformers recall facts. This matters because if Muon's advantage holds up in real large-scale language model training, it could make AI training faster and cheaper — though the paper doesn't yet show that in practice.
Why it matters
For years, practitioners noticed Muon worked better empirically but nobody understood why or how much better it could be; this paper cracks open the first rigorous explanation of the advantage, which is a necessary step toward either adopting it broadly or finding even better optimizers.