Researchers built a synthetic patient dataset to test AI health agents — revealing which ones actually reason through medical timelines

What happened

A team created ESL-Bench, a benchmark dataset of 100 synthetic patients with 1-5 year health histories, complete with device data, clinical exams, and life events, so AI systems can be evaluated on medical reasoning without exposing real patient data. The finding: database-backed AI agents beat memory-augmented ones by roughly 50% on tasks requiring multi-step reasoning and proof — the first time anyone could measure this precisely.

Why it matters

Until now, testing AI health agents against real patient data wasn't possible at scale because of privacy rules and the lack of definitive answers to complex medical questions. This synthetic dataset removes that bottleneck, making it possible to actually measure whether an AI can trace causation through a patient's medical history or just pattern-match. The real surprise is the gap: database agents handle comparison and explanation queries much better than retrieval-based ones, which means the architecture of the AI system, not just the quality of its training, determines whether it can reason backwards from symptoms to cause.

The signal

Watch whether this benchmark gets adopted in published AI health agent papers over the next 12 months — uptake would signal whether researchers and companies actually care about testing longitudinal medical reasoning, or if they keep optimizing for synthetic tasks that don't match clinical work.