The world is being quietly rearranged by people who write very long documents.


The title they went with WHBench: Evaluating Frontier LLMs with Expert-in-the-Loop Validation on Women's Health Topics Noisy translates that to

AI models fail basic women's health questions, a new test shows how


Researchers built a new test for AI models that give medical advice on women's health. It turns out, even the best models fail more than a quarter of the time and make unsafe errors.
For years, it was hard to tell exactly how well AI models handled sensitive medical topics like women's health. This new test gives developers and regulators a specific way to find out where models make mistakes, including unsafe omissions and dosing errors. It means AI tools for health can now be held to a higher, more specific standard.
Watch if AI developers start using this benchmark to improve their models, or if health regulators cite its findings in new guidance.

If you insist
Read the original →