AI can now generate coding homework at scale — but humans still have to design the trick answers

What happened

Researchers built an AI system that writes multiple-choice coding questions for students, then had six experts rate 288 AI-generated questions across seven teaching criteria. The system succeeded at generating clear questions with valid code (80–99% success rates), but failed at the harder pedagogical work: designing wrong answers that teach something, and writing explanations that actually deepen understanding.

Why it matters

Educational content generation is labor-intensive and repetitive — exactly the kind of work AI looks built for. This paper shows the boundary where AI hits a wall. The system handles the mechanical parts perfectly: checking that code runs, matching learning objectives, stating questions clearly. But it fails at the craft part: understanding what mistake a student should make to learn something, or why a distractor matters. This means AI can automate maybe 40–50% of homework writing, but not the parts that actually require teaching skill. Schools will now have to decide whether filling in AI-generated scaffolding is worth the real teacher time saved, or whether it just creates more busywork to grade.

The signal

Whether programming courses at universities start using CODE-GEN or similar systems in the next two years, and whether students taught with AI-generated questions perform differently on exams than those taught with human-written ones.