The world is being quietly rearranged by people who write very long documents.


The title they went with TernaryLM: Memory-Efficient Language Modeling via Native 1.5-Bit Quantization with Adaptive Layer-wise Scaling Noisy translates that to

Researchers shrink language models to 1.5-bit weights without losing capability


Computer scientists have figured out how to train small language models using extremely compressed numerical representations (ternary weights: -1, 0, or +1) from the start, rather than compressing already-trained models afterward. This cuts memory use by 2.4x and makes it practical to run useful language models on phones, laptops, and other devices that normally can't handle them.
The bottleneck for deploying language models on edge devices has been memory — most useful models are too big to fit. If you can train models that are natively compact rather than post-hoc compressed, you remove a major cost barrier to scaling AI inference on consumer hardware, and you also get a side benefit: the discrete weights act as a regularizer, preventing overfitting on small training datasets.

If you insist
Read the original →