First real-world benchmark for AI agents remembering users across years and domains

What happened

Researchers created the first large-scale test of whether AI language models can actually remember and recognize individual users over long periods and across different topics, using real shopping behavior from Amazon instead of fake scripted conversations. Most AI memory systems today fail this test badly — they can't reliably track what an actual person cares about over months or years, which matters because companies want AI assistants that feel personalized and remember you.

Why it matters

Until now, AI memory systems have only been tested on short, artificial conversations that don't reflect how real people interact with AI over time. This benchmark reveals that current memory methods don't work well enough for the personalization that companies are betting on — meaning either AI assistants will need fundamentally different memory architecture, or personalization claims are getting ahead of what the technology can actually do.