The world is being quietly rearranged by people who write very long documents.


The title they went with InfoTok: Information-Theoretic Regularization for Capacity-Constrained Shared Visual Tokenization in Unified MLLMs Noisy translates that to

AI researchers find a better way to compress images for multimodal AI — could mean faster image understanding and generation in a single model


Researchers developed a method to decide which information from images matters most when compressing them into tokens — the tiny units AI models use to process visual data. The method uses information theory to preserve what's useful for both understanding and generating images, while throwing away redundant noise.
Today's multimodal AI models (the ones that can both understand and generate images) have to squeeze images into a tight token budget. The question is: what gets thrown away? This work says: throw away entropy and redundancy, keep structure. It's a more principled answer than current architecture-driven choices. The practical effect is cleaner: when you combine image understanding and generation in one model, you lose less useful information in the compression step. That means better performance at both tasks without any additional training data.
Whether major ML labs adopt this tokenization method in their next-generation multimodal models, and whether it measurably improves image understanding accuracy or generation quality compared to existing approaches.

If you insist
Read the original →