What happened
Researchers found that when language models share the same parameter weights between input and output layers (a common cost-saving trick), those shared weights end up optimized for predicting the next word rather than understanding the input — making them worse at both jobs. This matters because it means a widespread design choice actually harms how well the model learns, especially for smaller models where every parameter counts.
Why it matters
This reveals a systematic flaw in how most language models are built: a cost-saving shortcut that was assumed to be neutral actually degrades performance by forcing a single set of weights to serve two incompatible purposes. For small language models, where parameter efficiency matters most, this hidden cost could justify using more parameters instead — changing how teams balance model size against training cost.