The world is being quietly rearranged by people who write very long documents.


The title they went with AEGIS: Scaling Long-Sequence Homomorphic Encrypted Transformer Inference via Hybrid Parallelism on Multi-GPU Systems Noisy translates that to

Researchers cut communication overhead in encrypted AI by over 80% — making private inference on GPUs practical


A team figured out how to run AI models on encrypted data across multiple GPUs without drowning in data transfers between machines. Previously, encrypted AI was so communication-heavy that it was basically unusable at scale; this reduces that overhead by 57–81% depending on the task, making it fast enough that four GPUs can actually work together efficiently.
Fully homomorphic encryption (the math that lets you run computations on data without ever decrypting it) has been theoretically perfect for privacy for years but practically unusable — the overhead was so massive that even small models on multiple GPUs would spend most of their time shuffling encrypted data around instead of computing. This paper shows how to coordinate what gets sent between GPUs by looking at both the AI model's structure and the encryption's mathematical dependencies at the same time, instead of treating them as separate problems. That matters because cloud services, hospitals, and financial companies that want to use AI on sensitive data without actually seeing it now have a clearer path to doing it at reasonable speed.
Watch whether commercial cloud providers (AWS, Google Cloud, Azure) begin offering encrypted inference as a service within 18 months, or whether the speedup remains a laboratory result that doesn't translate to actual products.

If you insist
Read the original →