The world is being quietly rearranged by people who write very long documents.


The title they went with Social Meaning in Large Language Models: Structure, Magnitude, and Pragmatic Prompting Noisy translates that to

Large language models now quantifiably mimic human reasoning about meaning — but often get the strength wrong


Researchers measured whether large language models capture not just the pattern of human social reasoning but also its magnitude, and found they do the first but not reliably the second. Prompting the models to think through speaker knowledge and motives helps, but no technique fully solves the calibration problem — the models get the *structure* of inference right but often amplify or diminish the strength of conclusions.
Until now, the field has assumed that if an AI system produces outputs that *look* like human reasoning, it understands social meaning the way humans do. This paper shows that's only half true: the models reliably reproduce the qualitative pattern of inference but distort its magnitude in unpredictable ways. That matters because it suggests some of the ways LLMs appear to understand context or nuance are structural mimicry, not comprehension — they're following a shape they've learned without calibrating to the right weight. The pragmatic theory prompts help partially, which means there's a real lever for improvement, but it's incomplete.
Watch whether subsequent LLM training or prompting methods actually achieve the fine-grained magnitude calibration this paper couldn't solve, or whether the calibration gap persists as models scale.

If you insist
Read the original →