The world is being quietly rearranged by people who write very long documents.


The title they went with Weight Tying Biases Token Embeddings Towards the Output Space Noisy translates that to

Language models waste parameters by optimizing embeddings for output, not input


Researchers found that when language models share the same parameter weights between input and output layers (a common cost-saving trick), those shared weights end up optimized for predicting the next word rather than understanding the input — making them worse at both jobs. This matters because it means a widespread design choice actually harms how well the model learns, especially for smaller models where every parameter counts.
This reveals a systematic flaw in how most language models are built: a cost-saving shortcut that was assumed to be neutral actually degrades performance by forcing a single set of weights to serve two incompatible purposes. For small language models, where parameter efficiency matters most, this hidden cost could justify using more parameters instead — changing how teams balance model size against training cost.

If you insist
Read the original →