The world is being quietly rearranged by people who write very long documents.


The title they went with Brevity Constraints Reverse Performance Hierarchies in Language Models Noisy translates that to

Smaller language models outperform giant ones on 7.7% of tasks when forced to answer briefly


Researchers found that the largest language models (like GPT-4) actually perform worse than much smaller models on roughly 1 in 13 benchmark problems, because they tend to over-explain their answers in ways that introduce errors. Adding a simple instruction to keep responses short reverses these performance gaps entirely, with the largest models now outperforming small ones by 7-16 percentage points on math and science questions.
This suggests that AI evaluation practices are systematically masking the actual capabilities of larger models — the bigger models aren't fundamentally weaker on certain tasks, they're just being asked to respond in ways that trip them up. The practical implication is immediate: how you ask an AI to answer a question matters as much as which model you use, and companies deploying large language models may be wasting computational resources by using one-size-fits-all prompts instead of tailoring how they phrase requests. This inverts the common assumption that you simply need a bigger model to get better results.
Whether deployed language models in production systems (customer service bots, search summaries, code generation tools) adopt brevity constraints, and whether that measurably reduces error rates compared to current baselines.

If you insist
Read the original →