New AI benchmark reveals even the best language models fail two-thirds of expert-level tasks

What happened

Researchers built a test of 1,346 professional tasks across finance, law, medicine, and research — and found that the most advanced AI systems succeed only about 66% of the time, with an average score around 55%. This is the first concrete measurement of how far current AI still is from actually doing expert work reliably.

Why it matters

For years, AI capability tests have measured performance on generic tasks or self-graded assignments, which makes the results unreliable — the AI is essentially grading itself. This benchmark uses real professional rubrics and expert-written problems, so it measures what actually matters: can an AI system do work that a lawyer, doctor, or researcher would accept? The finding is stark: current systems can't, not reliably. This means the gap between 'impressive demo' and 'ready to replace a professional' is still enormous.

The signal

Watch whether companies building AI for professional work start using this benchmark publicly, and whether published results show improvement over time — or whether they quietly use different, looser tests instead.