The world is being quietly rearranged by people who write very long documents.


The title they went with The Lifecycle of the Spectral Edge: From Gradient Learning to Weight-Decay Compression Noisy translates that to

Neural networks compress their own learning process — and the compression matters more than the original skill


Researchers watching neural networks learn two specific tasks discovered that right before the network suddenly works, something structural shifts: the main direction of weight change flips from being driven by gradients (how the network learns) to being driven by weight decay (a penalty that pushes weights toward zero). Once this flip happens, that main direction becomes 4,000 times more important to the network's actual behavior than random other directions, even though removing it doesn't break the learned skill — it just forces the network to re-encode the same information differently.
This is evidence that neural networks have a two-stage lifecycle that's actually observable and measurable. The network doesn't learn smoothly — it learns in one way (guided by gradients), then suddenly reorganizes itself in a different way (guided by a compression penalty). The surprise is that the compression phase, which looks like it's losing information, is actually the critical part. If this pattern holds across more tasks and architectures, it would mean the way neural networks actually work inside is fundamentally different from how we've been thinking about it: not 'learn a skill' but 'learn a skill, then compress it into a form you can actually use.'
Does this two-phase lifecycle show up in other training tasks beyond these two sequence problems, and across different network architectures — or is it specific to these tasks?

If you insist
Read the original →