The world is being quietly rearranged by people who write very long documents.


The title they went with Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation Noisy translates that to

New AI benchmark reveals even the best language models fail two-thirds of expert-level tasks


Researchers built a test of 1,346 professional tasks across finance, law, medicine, and research — and found that the most advanced AI systems succeed only about 66% of the time, with an average score around 55%. This is the first concrete measurement of how far current AI still is from actually doing expert work reliably.
For years, AI capability tests have measured performance on generic tasks or self-graded assignments, which makes the results unreliable — the AI is essentially grading itself. This benchmark uses real professional rubrics and expert-written problems, so it measures what actually matters: can an AI system do work that a lawyer, doctor, or researcher would accept? The finding is stark: current systems can't, not reliably. This means the gap between 'impressive demo' and 'ready to replace a professional' is still enormous.
Watch whether companies building AI for professional work start using this benchmark publicly, and whether published results show improvement over time — or whether they quietly use different, looser tests instead.

If you insist
Read the original →