The world is being quietly rearranged by people who write very long documents.


The title they went with Rocks, Pebbles and Sand: Modality-aware Scheduling for Multimodal Large Language Model Inference Noisy translates that to

Faster responses for AI systems handling mixed text and video


A new scheduling system makes AI services respond faster when users send requests with different media types — videos, images, and text — together. Right now, one large video request can bog down the entire system; this change lets smaller text requests go through quickly while bigger video requests process in the background, cutting wait times for interactive requests by 78%.
As AI services handle richer inputs (video, not just text), the bottleneck has shifted from compute to scheduling — the order in which requests get processed. This work shows the bottleneck is solvable with software alone, meaning companies can deploy multimodal AI without buying expensive new hardware, and users get usable latency instead of frustrating delays.

If you insist
Read the original →