Robot training hits a wall: better vision doesn't help if actions are chopped into discrete chunks

What happened

Researchers found that upgrading vision systems in robot models stops working once actions are encoded as discrete tokens—the token system becomes the bottleneck instead. This means you can't blindly scale robot perception; you have to fix the bottleneck first, or better sensors won't translate to better robot control.

Why it matters

For years, roboticists assumed robot models scaled the same way language and vision models do: throw a better sensor or encoder at the problem, get better results downstream. This paper shows that assumption breaks when you discretize actions into a fixed vocabulary. The practical effect is immediate: if your robot action system uses a fixed codebook, spending money on a better camera is wasted money until you also expand the action codebook. This reveals a hidden cost structure in physical AI—you can't optimize pieces independently. You have to identify where information is actually getting crushed in the pipeline.

The signal

Watch whether robotics labs that currently use discrete action tokens start switching to continuous representations (like Diffusion Policy), or instead start investing in larger action codebooks alongside better vision systems—either move would suggest this bottleneck insight is becoming practice, not just theory.