The world is being quietly rearranged by people who write very long documents.


The title they went with CoLoRSMamba: Conditional LoRA-Steered Mamba for Supervised Multimodal Violence Detection Noisy translates that to

Violence detection AI now works better with sound — if the video and audio actually match


Researchers built a video-analysis system that uses audio to detect violence more accurately, but only when the sound relates to what's happening on screen. The system learned to ignore noisy or irrelevant audio, achieving 88% accuracy on security camera footage where audio and video aligned.
Most violence-detection systems either ignore audio entirely or treat it as a separate input stream. This one makes the audio adapt based on what the video sees — the system learns which sounds matter for the specific scene. In practice, this means security systems can get more reliable alerts without needing to engineer which audio signals matter in advance. But the catch is brutal: the system only works when audio and video are meaningfully related. In messy real-world deployments — noisy streets, crowded spaces, poor synchronization — this advantage evaporates.
The real test is whether this accuracy holds on deployment footage from actual surveillance systems with naturally noisy, desynchronized, or weakly-related audio-video pairs, not curated datasets.

If you insist
Read the original →