AI still can't reliably follow complex rules — even when translated to computer code

What happened

Researchers built a benchmark test of 6,232 tasks where AI has to reason about rules from taxes, airline policies, immigration law, and housing regulations. The best available AI models solved less than half of them, and even when given the option to translate rules into formal computer code, current AI still fails to execute them reliably.

Why it matters

This is measurement of a real gap: large language models work well on tasks where the answer follows from what the model has seen in training, but they struggle when they have to actually apply rules to new situations. In legal, tax, and policy work, that gap matters — because the whole point is applying known rules to novel cases. The benchmark itself is the signal: it's a publicly available tool that will now let researchers measure whether the next generation of AI actually improves at rule-following, which is essential before anyone deploys these systems to make decisions that affect people's taxes, immigration status, or housing eligibility.

The signal

Watch whether major AI labs publish improvements on DeonticBench scores over the next 18 months, or whether they avoid publishing scores at all — silence suggests the problem is harder than expected.