AI models struggle to grade German short answers, even with custom prompts

What happened

Researchers tried several methods, including a new "meta-prompting" technique, to get AI models to score German student answers based on rubrics. The AI models performed poorly, consistently ranking in the middle or lower half of participants in a shared task. This means current AI tools are not yet reliable for automated grading of nuanced, short-form answers in languages other than English.

Why it matters

Automated grading of student work is a long-standing goal for educators, promising to reduce teacher workload and provide faster feedback. This paper shows that even with advanced prompting techniques, AI models still struggle with the subtleties of language and rubric interpretation, especially in non-English contexts. It suggests that the path to fully automated, reliable grading is longer than some might assume, particularly for tasks requiring nuanced understanding.

The signal

Watch for future shared tasks or benchmarks that show significant improvements in AI's ability to grade short, rubric-based answers in non-English languages, particularly for tasks requiring more than simple keyword matching.