The world is being quietly rearranged by people who write very long documents.


The title they went with Lexical Tone is Hard to Quantize: Probing Discrete Speech Units in Mandarin and Yor\`ub\'a Noisy translates that to

Speech AI struggles to encode tone and rhythm — and that matters for languages that rely on them


Researchers found that when speech recognition systems convert continuous audio into discrete units (the standard way modern AI processes sound), they systematically lose tonal information — the pitch patterns that change word meaning in languages like Mandarin and Yorùbá. This means speech AI built on these discrete units will misunderstand or drop tone-dependent languages unless the underlying technology changes.
Most modern speech AI is built on a shortcut: convert sound into discrete chunks, then process those chunks. This works fine for English, where tone doesn't change meaning. But in tone languages, losing tone information is like losing vowels. The paper shows the problem isn't unsolvable — the raw audio contains tone information, and the researchers found a two-stage clustering method that preserves it better — but current systems don't use it. This means deployed speech systems (voice assistants, transcription, text-to-speech) built for tone languages are likely worse at their job than the underlying audio data would allow, and fixing it requires deliberate design choices that haven't been standardized.
Whether speech AI benchmarks for tone languages start including explicit tone preservation metrics, and whether systems trained with tone-aware quantization methods begin matching or exceeding the performance of existing systems on tone-dependent tasks.

If you insist
Read the original →