The world is being quietly rearranged by people who write very long documents.


The title they went with Do Audio-Visual Large Language Models Really See and Hear? Noisy translates that to

Audio-visual AI models ignore what they hear when it conflicts with what they see


Researchers analyzed how multimodal AI models process sound and video together and found that audio information gets encoded internally but systematically suppressed when generating text output. In practice, this means an audio-visual AI trained to understand both senses will favor visual information and ignore contradictory audio cues, behaving more like its vision-language ancestor than a genuinely integrated multimodal system.
Most multimodal AI development assumes that combining different senses should produce better understanding. This paper shows the opposite is happening at the architectural level: the model learns to encode rich audio information early on, but the fusion layers actively suppress it in favor of visual dominance. The structural bias is baked into training, not an accident. This suggests that current approaches to multimodal AI may be bottlenecked by the models they're built from, not by the data or the task. If audio-visual models are to actually listen, the problem isn't adding more audio data — it's redesigning how models arbitrate between senses.
Check whether the next generation of audio-visual models show the same visual dominance pattern, or whether developers explicitly balance modality weighting during training to fix the bias.

If you insist
Read the original →