The world is being quietly rearranged by people who write very long documents.


The title they went with AsyncTLS: Efficient Generative LLM Inference with Asynchronous Two-level Sparse Attention Noisy translates that to

Making long-text AI inference 10x faster by doing less work on unimportant tokens


A new method for running large language models on long documents cuts the computational work in half or more by deciding which words actually matter before processing them. This means AI services can handle longer documents faster and cheaper, making it practical to run these models on cheaper hardware.
Long-context LLM inference has been expensive because the attention mechanism (the part that decides which words matter for each new word) scales with the square of document length — double the context, quadruple the work. This proposal cuts that cost by filtering out unimportant tokens early, which is what would have to happen for running models on longer documents to become routine rather than expensive. The numbers are real: 1.2x to 10x speedups across tested models and context lengths. What changes in practice is whether a company can afford to add a long-document feature without redesigning their infrastructure.
Whether these speedups hold up in production on real-world applications with messy inputs, not just benchmark tests — and whether the accuracy loss on important-but-buried information stays acceptable as context lengths approach what users actually request.

If you insist
Read the original →