Researchers measure how much LLM a reasoning agent actually needs — explicit planning beats language model revision

What happened

A team built an agent that separates what an LLM actually does from the structured reasoning around it, isolating each component's contribution to the agent's performance. The finding: explicit world-modeling and planning improve performance substantially, while LLM-based revision adds almost nothing measurable — suggesting most of an agent's competence comes from scaffolding, not the language model itself.

Why it matters

For two years, the AI field has watched researchers build increasingly complex agents that bundle planning, memory, and reflection inside a single LLM loop. The question nobody could answer was obvious: which parts actually work because of the language model, and which parts work because of the structure around it? This paper cuts that knot by externalizing agent state into inspectable runtime — meaning you can now watch what each component actually contributes. The practical implication is stark: if explicit symbolic planning and structured reflection do most of the work, then the expensive part (running an LLM every step) might not be doing what people think it's doing. This opens the possibility that agents could be cheaper, faster, or more transparent by stripping away the LLM and keeping the scaffolding.

The signal

Whether the next wave of agent research starts externalizing reflection and planning into declarative structures instead of bundling everything into the LLM, or whether the field continues to ignore the measurement and keeps building bigger black-box loops.