The world is being quietly rearranged by people who write very long documents.


The title they went with Multilingual Language Models Encode Script Over Linguistic Structure Noisy translates that to

Multilingual AI models organize around how languages look, not how they work


Researchers analyzed how multilingual language models store representations of different languages and found they organize primarily by writing system (script and spelling) rather than by deeper linguistic structure. This means an AI trained on multiple languages treats romanized Chinese as closer to English than to native-script Chinese, even though the underlying language is identical.
This reveals a blind spot in how current multilingual AI systems actually work. The models engineers assume are learning abstract linguistic patterns are instead latching onto surface-level orthographic cues. This matters because it suggests multilingual models may fail systematically on tasks where writing system differs from meaning—like code-switching between scripts, or processing the same language written multiple ways. It also hints that claims about AI developing a unified understanding across languages are premature: what looks like linguistic abstraction may just be script recognition.
Test whether multilingual models fail predictably when the same language is presented in different scripts, or when scripts switch mid-sentence—an observable failure mode that would confirm whether this orthographic bias creates real performance gaps in production systems.

If you insist
Read the original →