New math shows why one optimizer trains language models faster than another

What happened

Researchers proved mathematically that a newer training algorithm called Muon can store and learn more information per computational step than the standard method (SGD), especially when training on associative memory tasks that mimic how transformers recall facts. This matters because if Muon's advantage holds up in real large-scale language model training, it could make AI training faster and cheaper — though the paper doesn't yet show that in practice.

Why it matters

For years, practitioners noticed Muon worked better empirically but nobody understood why or how much better it could be; this paper cracks open the first rigorous explanation of the advantage, which is a necessary step toward either adopting it broadly or finding even better optimizers.