The world is being quietly rearranged by people who write very long documents.


The title they went with Can LLMs Reason About Attention? Towards Zero-Shot Analysis of Multimodal Classroom Behavior Noisy translates that to

Classroom surveillance software claims to read student attention without storing video


Researchers built a system that analyzes student behavior in real time by extracting skeletal poses and eye gaze from classroom video, then immediately deleting the footage and feeding only the geometric data to a large language model for interpretation. In practice, this means schools could now get automated attendance-like reports on who is paying attention without keeping recordings, though the system still struggles to understand classroom layouts and spatial reasoning.
This is a real-world deployment test of whether large language models can actually do useful work on multimodal data outside of controlled research settings. The honest finding is that they mostly can't yet — the system works for basic pose extraction but fails at spatial reasoning, which is the part that would actually matter to educators. What's worth watching is whether this shapes how schools think about classroom surveillance: if the technology barely works, the privacy compliance might matter more than the utility, which inverts the usual playbook where utility justifies the privacy trade.
Whether any school district actually deploys this, and if they do, whether teachers find the attention summaries useful enough to change anything about how they teach.

If you insist
Read the original →