The world is being quietly rearranged by people who write very long documents.


The title they went with Faster Superword Tokenization Noisy translates that to

Tokenization algorithm now trains 600 times faster, opening the door to better language model vocabulary


Researchers figured out how to train a tokenization algorithm (the method that breaks text into chunks for AI models) vastly faster by changing how the algorithm counts and combines chunks of text. Instead of keeping entire documents in memory, the new approach counts chunk combinations the same way it counts individual chunks, which means the whole training process dropped from nearly 5 days to 10 minutes on the same amount of data.
Tokenization is a bottleneck that nobody notices until they hit it. Language models need to chop text into workable pieces before they learn anything, and the algorithm that does this chopping (byte pair encoding) has been stuck with the same slow training process for years. This speedup doesn't change what the model learns, but it changes what's feasible to experiment with. Faster training means researchers can test whether different tokenization strategies actually improve model performance, instead of just using whatever is fast enough to train overnight. The practical effect: the algorithms that were theoretically better but too slow to bother with (boundless BPE and SuperBPE) become actually usable.
Watch whether new language model training runs begin using these faster tokenization methods instead of the standard byte pair encoding, and whether that produces measurably better model performance per dollar of compute spent.

If you insist
Read the original →