AI image generators fail when asked to draw more than a few people together

What happened

Current AI text-to-image models can reliably generate recognizable versions of 2-4 specific people in one image, but catastrophically fail and produce blurry clones when asked to generate 6-10 people or when those people interact physically. This matters because it exposes a fundamental architectural limit: these models don't actually understand how to keep different identities separate at scale, they just get lucky on simple cases.

Why it matters

This is a stress-test that demonstrates current generative AI has hit a hard scaling wall in a seemingly simple task — maintaining distinct identities in a single image. The paper shows that standard metrics used to evaluate these models are misleading (they score identity-collapsed images as correct), which means the industry has been optimistic about progress it hasn't actually made. It's an honesty check about what these systems can and cannot do.