What happened
Researchers built a test of 10,000 Islamic knowledge questions and ran 26 AI models through it. Performance ranged from 40% to 94% accuracy, with models systematically failing on Quranic interpretation and showing strong bias toward particular Islamic schools of jurisprudence — the same AI tools millions of people consult for religious guidance have no consistent way to know when they are making mistakes.
Why it matters
Until now, nobody had a standardized way to measure whether AI systems give accurate Islamic religious guidance. This benchmark exposes something uncomfortable: frontier models (like Gemini) perform well overall, but the consistency masks systematic failures — some models are 99% accurate on Quranic questions, others only 32% accurate on the same material. Worse, models show hidden preference for particular Islamic schools without transparency, meaning a user asking the same question could get a doctrinally different answer depending which model they use. This matters because millions of Muslims already ask AI systems about religious practice, inheritance law, and ritual guidance — areas where consistency and accuracy have real stakes.