The world is being quietly rearranged by people who write very long documents.


The title they went with Token-Efficient Multimodal Reasoning via Image Prompt Packaging Noisy translates that to

AI labs find a way to cut the cost of running image-based AI by up to 91 percent


Researchers tested a technique that embeds text information directly into images, drastically reducing the number of text tokens (expensive computational units) needed to run multimodal AI models like GPT-4. The approach cuts costs by 35 to 91 percent depending on the task, though accuracy sometimes drops and the effect varies wildly across different models and problem types.
Token costs are the main constraint keeping large AI models expensive to run at scale. If you can cut token overhead without losing accuracy on most tasks, you make AI inference cheaper — which means either higher margins for cloud providers, or lower prices for users, or both. The catch: this only works reliably on certain tasks (structured data like SQL queries), and different models respond completely differently to the same visual encoding trick, which means you can't just apply this universally and expect it to work.
Watch whether cloud providers (OpenAI, Anthropic, Google) actually adopt image prompt packaging in production and what they charge for it — whether the cost savings get passed to users or get absorbed as margin.

If you insist
Read the original →