Smaller language models outperform giant ones on 7.7% of tasks when forced to answer briefly

What happened

Researchers found that the largest language models (like GPT-4) actually perform worse than much smaller models on roughly 1 in 13 benchmark problems, because they tend to over-explain their answers in ways that introduce errors. Adding a simple instruction to keep responses short reverses these performance gaps entirely, with the largest models now outperforming small ones by 7-16 percentage points on math and science questions.

Why it matters

This suggests that AI evaluation practices are systematically masking the actual capabilities of larger models — the bigger models aren't fundamentally weaker on certain tasks, they're just being asked to respond in ways that trip them up. The practical implication is immediate: how you ask an AI to answer a question matters as much as which model you use, and companies deploying large language models may be wasting computational resources by using one-size-fits-all prompts instead of tailoring how they phrase requests. This inverts the common assumption that you simply need a bigger model to get better results.

The signal

Whether deployed language models in production systems (customer service bots, search summaries, code generation tools) adopt brevity constraints, and whether that measurably reduces error rates compared to current baselines.