The world is being quietly rearranged by people who write very long documents.


The title they went with Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification Noisy translates that to

First benchmark measures how well AI builds websites from screenshots


Researchers created a standardized test suite with 193 real-world website development tasks to measure how well AI coding agents can turn visual designs into working code — from simple static pages to complex multi-page sites with backend logic. This is the first systematic way to compare different AI models on the actual end-to-end work of building websites, revealing that even the best models still fail frequently on harder tasks.
Until now, there was no reliable way to measure whether AI coding agents were actually getting better at a real, economically significant task. This benchmark makes that measurable — it shows current AI struggles with the hard parts of web development, which matters because it lets companies, researchers, and investors see exactly where the technology still falls short before betting on it to replace human developers.

If you insist
Read the original →