AI agents fail most finance tasks — even the best one completes only 38% of real spreadsheet workflows
What happened
Researchers built a benchmark of 172 real finance and accounting tasks from actual enterprise spreadsheets, emails, and documents spanning 25 years — messy, multifile work that mirrors what accountants and financial analysts actually do. When they tested the best AI agents available (including GPT 5.1, Claude, Gemini), even the highest performer spent 16.8 minutes per task and succeeded less than 40% of the time, revealing a hard gap between what AI can do in controlled settings and what it needs to do in enterprise reality.
Why it matters
For years, the AI-for-work conversation has focused on what AI can theoretically do — write code, summarize documents, answer questions. But enterprises don't need isolated capabilities; they need agents that can navigate the actual texture of work: finding data across five files, fixing formatting errors, retrieving context from email threads from 2015, doing calculations that chain across spreadsheets. This benchmark shows that gap is massive and specific. It tells you that the bottleneck for AI in finance isn't intelligence; it's coordination and persistence across a messy, multi-step enterprise landscape. A company betting its financial workflow automation on current AI agents is betting on an incomplete product.
The signal
Watch whether vendors start shipping tools trained on FinWorkBench results, and whether those tools move success rates above 60% on comparable tasks — that would signal the benchmark is actually shaping development rather than just measuring failure.