The world is being quietly rearranged by people who write very long documents.


The title they went with Preservation Is Not Enough for Width Growth: Regime-Sensitive Selection of Dense LM Warm Starts Noisy translates that to

AI researchers still don't know which tricks actually work when scaling up smaller language models


Researchers tested different ways to reuse smaller AI models as starting points for bigger ones, and found no single best method — what works depends on whether you're optimizing for speed or accuracy. This matters because scaling up existing models is cheaper than training from scratch, but the field doesn't yet have reliable rules for which shortcuts to take.
For years, AI labs have assumed that the simplest approach to scaling — copying weights directly — would preserve progress. This paper shows that assumption breaks down depending on the specific scenario: exact copies work best in some situations, but structured changes win in others. The practical consequence is that teams building larger models from smaller checkpoints will need to test multiple approaches instead of relying on a single method, which adds time and computational cost to a process that's supposed to be cheaper than starting over.
The question is whether larger AI labs will adopt these multi-method testing approaches as standard practice, or continue betting on single heuristics and only learn this lesson when they hit scaling walls.

If you insist
Read the original →