Language models pick the most compressible answer, not the true one — unless falsehood is obviously incoherent
What happened
Researchers trained small language models on contradictory data (math problems with both right and wrong answers) and found that models don't actually prefer truth — they prefer whichever answer compresses more cleanly into their learned patterns. When errors follow a coherent alternative rule system, models choose the false system at chance rates; when errors are random noise, models extract the correct signal.
Why it matters
This is a mechanics question about why language models work at all, not a capability test. The finding suggests that what we call 'truth bias' in large models may not be an intrinsic preference for accuracy, but rather an artifact of how real-world data is structured — truth tends to have fewer contradictions because it's internally consistent, making it more compressible. This matters because it reframes a core assumption about how language models learn: they're not truth-seeking engines, they're compression engines that happen to compress truth well when truth is abundant and contradictions are noisy. If this holds at scale, it changes how we should think about why models fail. A model doesn't refuse a false claim because it knows better; it accepts it if the false claim fits into a coherent, compressible pattern it has already learned.
The signal
Whether larger models (100B+ parameters, trained on real internet scale) show the same sharp crossover point where adding a second coherent alternative rule restores truth preference — or whether scale itself breaks this pattern.