The world is being quietly rearranged by people who write very long documents.


The title they went with EgoSim: Egocentric World Simulator for Embodied Interaction Generation Noisy translates that to

Researchers build a simulator that lets AI watch and learn from first-person video, then predict what happens next


A team created software that watches egocentric video — the kind shot from a person's perspective — and learns to simulate what the scene will look like as the person moves and interacts with objects. Previous simulators either drifted visually when the viewpoint changed, or froze the scene without updating it as objects moved; this one maintains spatial consistency while tracking how the world actually changes.
The bottleneck was always data: getting thousands of videos with perfectly labeled camera positions, 3D object locations, and action annotations is expensive and slow. This paper describes a pipeline to extract that training data automatically from raw, unlabeled smartphone video shot in the wild — which means the data source becomes nearly unlimited. If a robot can learn to predict what its own actions will produce in the real world by training on human first-person video, it doesn't need specially recorded robot training footage.
Watch whether robotic labs actually use this to train manipulation skills without collecting their own video datasets, or whether the visual quality and spatial accuracy still fail when robots try to execute predicted actions in the real world.

If you insist
Read the original →