The world is being quietly rearranged by people who write very long documents.


The title they went with Multimodal Large Language Models for Multi-Subject In-Context Image Generation Noisy translates that to

AI model learns to generate images with multiple people without losing track of who is who


Researchers built a multimodal AI system that can create images containing several different people from reference photos without mixing up their identities or losing subjects. In practice, this means you can now ask an AI to generate a scene with specific people in it, and the AI will actually remember which face is which across the entire image rather than degrading into a blurry composite.
Generating images with multiple subjects is a known hard problem for current AI systems. They tend to forget subjects, confuse identities, or degrade visually when you ask them to handle more than one or two reference images. This paper demonstrates a training approach that scales, which means the constraint of 'one or two faces max' stops being a technical ceiling. What this enables is less clear outside research settings. The benchmark is new and synthetic. Nobody deployed this at scale yet.
Watch whether MUSIC or similar multi-subject generation systems show up in commercial image generation products within the next 12 months, or remain confined to research demonstrations.

If you insist
Read the original →