AI researchers find that vague real-world instructions break tool-selection systems built on precise academic examples

What happened

LLMs that pick the right tool from a library work fine when given detailed instructions but fail when instructions are vague, the way humans actually phrase them. A team built a new benchmark to test this gap, then developed a fix: having an AI rewrite vague instructions into specific ones before tool selection happens, which doubled performance on some systems.

Why it matters

This is a lab-to-reality gap. Academic benchmarks for tool selection use hyper-specific instructions (exact API names, parameter values) that never appear in actual use. The paper measures what everyone building deployed LLM systems has probably discovered by accident: vague instructions wreck retrieval. The fix is simple enough that it will probably get absorbed into every tool-calling system that needs to work outside research settings. The question isn't whether this matters — it's how many production systems are already failing this way without measurement.

The signal

Track whether major LLM providers (OpenAI, Anthropic, Meta) adopt instruction-rewriting approaches in their tool-calling systems, or whether they start publishing retrieval performance against vague instructions as a standard benchmark alongside the existing academic metrics.