The world is being quietly rearranged by people who write very long documents.


The title they went with IslamicMMLU: A Benchmark for Evaluating LLMs on Islamic Knowledge Noisy translates that to

We can now measure how badly AI gets Islamic knowledge wrong


Researchers built a test of 10,000 Islamic knowledge questions and ran 26 AI models through it. Performance ranged from 40% to 94% accuracy, with models systematically failing on Quranic interpretation and showing strong bias toward particular Islamic schools of jurisprudence — the same AI tools millions of people consult for religious guidance have no consistent way to know when they are making mistakes.
Until now, nobody had a standardized way to measure whether AI systems give accurate Islamic religious guidance. This benchmark exposes something uncomfortable: frontier models (like Gemini) perform well overall, but the consistency masks systematic failures — some models are 99% accurate on Quranic questions, others only 32% accurate on the same material. Worse, models show hidden preference for particular Islamic schools without transparency, meaning a user asking the same question could get a doctrinally different answer depending which model they use. This matters because millions of Muslims already ask AI systems about religious practice, inheritance law, and ritual guidance — areas where consistency and accuracy have real stakes.
Watch whether major AI labs publish their own Islamic knowledge scores and whether they disagree with this benchmark's methodology — disagreement would signal they know the test catches something they'd prefer not to publicize.

If you insist
Read the original →