Large language models now quantifiably mimic human reasoning about meaning — but often get the strength wrong
What happened
Researchers measured whether large language models capture not just the pattern of human social reasoning but also its magnitude, and found they do the first but not reliably the second. Prompting the models to think through speaker knowledge and motives helps, but no technique fully solves the calibration problem — the models get the *structure* of inference right but often amplify or diminish the strength of conclusions.
Why it matters
Until now, the field has assumed that if an AI system produces outputs that *look* like human reasoning, it understands social meaning the way humans do. This paper shows that's only half true: the models reliably reproduce the qualitative pattern of inference but distort its magnitude in unpredictable ways. That matters because it suggests some of the ways LLMs appear to understand context or nuance are structural mimicry, not comprehension — they're following a shape they've learned without calibrating to the right weight. The pragmatic theory prompts help partially, which means there's a real lever for improvement, but it's incomplete.
The signal
Watch whether subsequent LLM training or prompting methods actually achieve the fine-grained magnitude calibration this paper couldn't solve, or whether the calibration gap persists as models scale.