Language models fail silently when users rephrase questions — and it changes which model is actually best

What happened

A new benchmark reveals that small tweaks to how a question is phrased can cause language models to perform up to 12% worse on the same task, and these failures don't affect all models equally. This means model comparisons based on standard tests are misleading because they don't measure how models handle the messy, real-world input most users actually produce.

Why it matters

For years, AI labs have ranked models using clean, controlled benchmarks that assume perfect input — but users type typos, rephrase questions, and make mistakes. When researchers added semantics-preserving perturbations to standard tests, they found that prompt variations alone can account for half a model's measured performance variance. This matters because the relative ranking of models changed in 63% of cases with even a single perturbation. A model that looks best on today's benchmarks might perform worst in actual use, and procurement decisions based on benchmark scores are being made on a foundation that doesn't reflect production reality.

The signal

Monitor whether major AI evaluation frameworks (OpenAI's evals, Anthropic's internal testing, third-party benchmarking services) begin incorporating sensitivity testing as a standard part of model assessment.