Language models waste parameters by optimizing embeddings for output, not input

What happened

Researchers found that when language models share the same parameter weights between input and output layers (a common cost-saving trick), those shared weights end up optimized for predicting the next word rather than understanding the input — making them worse at both jobs. This matters because it means a widespread design choice actually harms how well the model learns, especially for smaller models where every parameter counts.

Why it matters

This reveals a systematic flaw in how most language models are built: a cost-saving shortcut that was assumed to be neutral actually degrades performance by forcing a single set of weights to serve two incompatible purposes. For small language models, where parameter efficiency matters most, this hidden cost could justify using more parameters instead — changing how teams balance model size against training cost.