The world is being quietly rearranged by people who write very long documents.


The title they went with RETUYT-INCO at BEA 2026 Shared Task 2: Meta-prompting in Rubric-based Scoring for German Noisy translates that to

AI models struggle to grade German short answers, even with custom prompts


Researchers tried several methods, including a new "meta-prompting" technique, to get AI models to score German student answers based on rubrics. The AI models performed poorly, consistently ranking in the middle or lower half of participants in a shared task. This means current AI tools are not yet reliable for automated grading of nuanced, short-form answers in languages other than English.
Automated grading of student work is a long-standing goal for educators, promising to reduce teacher workload and provide faster feedback. This paper shows that even with advanced prompting techniques, AI models still struggle with the subtleties of language and rubric interpretation, especially in non-English contexts. It suggests that the path to fully automated, reliable grading is longer than some might assume, particularly for tasks requiring nuanced understanding.
Watch for future shared tasks or benchmarks that show significant improvements in AI's ability to grade short, rubric-based answers in non-English languages, particularly for tasks requiring more than simple keyword matching.

If you insist
Read the original →