What happened
Researchers created a standardized test suite with 193 real-world website development tasks to measure how well AI coding agents can turn visual designs into working code — from simple static pages to complex multi-page sites with backend logic. This is the first systematic way to compare different AI models on the actual end-to-end work of building websites, revealing that even the best models still fail frequently on harder tasks.
Why it matters
Until now, there was no reliable way to measure whether AI coding agents were actually getting better at a real, economically significant task. This benchmark makes that measurable — it shows current AI struggles with the hard parts of web development, which matters because it lets companies, researchers, and investors see exactly where the technology still falls short before betting on it to replace human developers.