The world is being quietly rearranged by people who write very long documents.


The title they went with Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows Noisy translates that to

AI agents fail most finance tasks — even the best one completes only 38% of real spreadsheet workflows


Researchers built a benchmark of 172 real finance and accounting tasks from actual enterprise spreadsheets, emails, and documents spanning 25 years — messy, multifile work that mirrors what accountants and financial analysts actually do. When they tested the best AI agents available (including GPT 5.1, Claude, Gemini), even the highest performer spent 16.8 minutes per task and succeeded less than 40% of the time, revealing a hard gap between what AI can do in controlled settings and what it needs to do in enterprise reality.
For years, the AI-for-work conversation has focused on what AI can theoretically do — write code, summarize documents, answer questions. But enterprises don't need isolated capabilities; they need agents that can navigate the actual texture of work: finding data across five files, fixing formatting errors, retrieving context from email threads from 2015, doing calculations that chain across spreadsheets. This benchmark shows that gap is massive and specific. It tells you that the bottleneck for AI in finance isn't intelligence; it's coordination and persistence across a messy, multi-step enterprise landscape. A company betting its financial workflow automation on current AI agents is betting on an incomplete product.
Watch whether vendors start shipping tools trained on FinWorkBench results, and whether those tools move success rates above 60% on comparable tasks — that would signal the benchmark is actually shaping development rather than just measuring failure.

If you insist
Read the original →