The world is being quietly rearranged by people who write very long documents.


The title they went with Ada-MK: Adaptive MegaKernel Optimization via Automated DAG-based Search for LLM Inference Noisy translates that to

AI models can now run 50% faster on existing GPUs for real-time ads


A new software technique makes large AI models run much faster on common graphics cards. This means companies can use AI for things like real-time online advertising without delays, serving more requests with the same hardware.
Running large AI models in real-time for applications like online advertising was limited by how fast graphics cards could process requests. Each small calculation on the card added a tiny delay, which added up. This new technique cuts those delays by half in some cases. It means companies can deploy more responsive AI services, or handle a lot more users with their current hardware.
Watch for this optimization to be integrated into more widely used AI inference software libraries, and for other industries to adopt it for their real-time AI applications.

If you insist
Read the original →