The world is being quietly rearranged by people who write very long documents.


The title they went with Gelina: Unified Speech and Gesture Synthesis via Interleaved Token Prediction Noisy translates that to

AI now generates speech and gestures together instead of separately


Researchers built a system that synthesizes human speech and hand gestures simultaneously from text, rather than creating them as separate outputs. This matters because real human communication has speech and gestures tightly synchronized — when they're made independently, they fall out of sync and look unnatural.
This is an incremental improvement in video synthesis and animation technology, but it doesn't cross a threshold in cost, deployment, or capability that would affect non-researchers — the system works in a lab on research benchmarks, not in production systems that real people use.

If you insist
Read the original →