Language models know when they'll fail before they generate an answer

What happened

Researchers found that large language models encode information about their own success or failure in their internal state before they actually produce an answer. This means you can build a fast prediction tool that asks: will this model get this question right? And use that to route hard questions to better models or more expensive processing, cutting computational cost by up to 70 percent while keeping accuracy the same.

Why it matters

Right now, if you want a language model to think harder on a problem (extended reasoning, multiple attempts), you run it on everything, which is expensive. This paper shows the model already knows which problems will trip it up, buried in its activations before it starts writing. The practical effect is efficiency: you can now spend expensive compute only on questions the model will actually struggle with, skipping the easy ones entirely. This matters because it breaks the assumption that hard and easy questions are equally difficult to identify beforehand. The model and humans disagree about what's hard, and the model's own difficulty map is more useful for routing than asking humans.

The signal

Watch whether this prediction method actually reduces inference costs in production systems, or whether the probes degrade when applied to real-world queries that differ from the training tasks.