The world is being quietly rearranged by people who write very long documents.


The title they went with DIVERSED: Relaxed Speculative Decoding via Dynamic Ensemble Verification Noisy translates that to

AI can now reject fewer correct answers and still produce the same text — trading accuracy for speed


Researchers found a way to make large language models run faster by loosening the rules about which words the model can pick at each step. Instead of requiring the AI to pick words that match a reference model exactly, the new approach lets it pick from a slightly wider range of plausible words — as long as the final text still makes sense. In practice, this means the same AI can generate answers 40-50% faster without noticeably worse quality.
The bottleneck in AI inference has shifted. For the past few years, the constraint was compute — you needed more chips. This paper shows the constraint is now verification logic: the checking step that ensures quality is throwing out correct answers just to be safe. Loosening that check doesn't break quality, which means cheaper, faster inference without new hardware. This matters because inference cost directly determines whether an AI tool gets deployed at scale or stays in the lab. If you can cut inference time in half, you cut the operating cost in half.
Watch whether major AI deployment systems (cloud providers, open-source inference frameworks) adopt this relaxed verification approach within the next 12 months, and whether inference cost per token drops measurably as a result.

If you insist
Read the original →