AI learns to click where you point using screen attention maps

What happened

Researchers built a smaller AI model that can locate and click on screen elements by reading where it naturally pays attention to visual details, rather than generating exact pixel coordinates. This matters because it works efficiently with far less training data—about 100,000 screenshots instead of millions—and could make AI assistants that automate computer tasks more practical to build and deploy.

Why it matters

For the first time, someone showed that general-purpose multimodal AI models already contain latent spatial understanding embedded in their attention patterns, and you can unlock it with modest fine-tuning instead of building coordinate-prediction systems from scratch. This is a structural insight—it means the bottleneck for GUI automation may not be better architectures but better ways to extract what models already know.