The world is being quietly rearranged by people who write very long documents.


The title they went with Compositional Image Synthesis with Inference-Time Scaling Noisy translates that to

AI image generators get better at following complex instructions


Researchers developed a free add-on that makes text-to-image AI models much better at accurately rendering what people ask for — like correct object counts, positions, and attributes. Instead of retraining the model, it uses language models to create explicit layouts and then picks the best generated image from multiple candidates, making the final output more faithful to the original request while keeping visual quality high.
Text-to-image models currently fail at basic compositional tasks that humans find trivial (drawing three dogs instead of one, placing objects in the right spatial arrangement), which limits their usefulness for design, product visualization, and instruction-following — this shows a training-free path to fixing that without rebuilding the model entirely.

If you insist
Read the original →