Curriculum-Grounded LLM-as-Judge Pipeline Enhances Automated

Curriculum-Grounded LLM-as-Judge Pipeline Enhances Automated Exam Marking

Xiwei Xu, Chen Wang, Jacky Jiang, Phil Yang, Qian Fu, Mohan Dhall, Wenjie Zhang, Liming Zhu· June 17, 2026 View original

Summary

This paper introduces a curriculum-grounded LLM-as-Judge pipeline for automated question-level marking, co-developed with an industrial partner for university admission exam preparation. The pipeline systematically grounds LLM outputs in official curriculum artifacts and marking guidelines, generating question-specific rubrics and evaluating student responses with improved consistency and transparency.

The application of generative AI and large language models (LLMs) in education, particularly for automated assessment, is gaining traction. However, for high-stakes exams, simply using prompt engineering is insufficient; a robust software pipeline is needed to ensure LLM outputs are systematically aligned with authorized curriculum materials and official marking guidelines. A new "LLM-as-Judge" pipeline has been developed, specifically designed for question-level marking in university admission exam preparation. This configurable system identifies relevant topics, subtopics, and cognitive demands of a question, then compiles verifiable context from syllabus artifacts, including prescribed verbs, outcomes, performance descriptors, and glossary definitions. This ensures the LLM's judgment is deeply rooted in educational authority. The pipeline employs a staged LLM workflow. First, it generates question-specific rubrics that outline structured performance expectations. Then, it derives and applies marking criteria to allocate marks to student responses. This design significantly enhances the consistency and transparency of automated marking, ensuring alignment with official practices. Preliminary evaluations indicate that this LLM-as-Judge pipeline produces marking outcomes comparable to human tutors, with justifications that are more traceable to official curriculum and marking standards. Its integration into an online study platform is already providing initial operational insights.

Why it matters

For educational institutions, EdTech companies, and professionals involved in assessment, this pipeline offers a robust, transparent, and consistent method for automated marking, potentially reducing workload, improving feedback quality, and ensuring alignment with curriculum standards.

How to implement this in your domain

1Integrate curriculum artifacts and official marking guidelines directly into AI assessment pipelines.
2Develop staged LLM workflows to first generate rubrics and then apply marking criteria for student responses.
3Prioritize transparency and traceability in AI-generated feedback by linking it to authorized educational content.
4Pilot LLM-as-Judge systems in low-stakes environments to refine accuracy and gain user trust.
5Collaborate with educational experts to ensure AI assessment tools meet pedagogical and fairness standards.

Who benefits

EdTechHigher EducationK-12 EducationProfessional TrainingAssessment & Certification

Key takeaways

LLM-as-Judge systems can provide consistent and transparent automated marking for high-stakes exams.
Grounding LLM outputs in official curriculum and marking guidelines is crucial for educational applications.
A staged workflow generating rubrics before marking enhances alignment with human practices.
The pipeline offers marking outcomes comparable to human tutors with traceable justifications.

Original post by Xiwei Xu, Chen Wang, Jacky Jiang, Phil Yang, Qian Fu, Mohan Dhall, Wenjie Zhang, Liming Zhu

"arXiv:2606.17507v1 Announce Type: new Abstract: Generative AI and large language models (LLMs) are increasingly applied to question generation and automated assessment. However, deploying LLMs in preparation for high-stakes exams requires more than prompt engineering; it demands…"

View on X

Originally posted by Xiwei Xu, Chen Wang, Jacky Jiang, Phil Yang, Qian Fu, Mohan Dhall, Wenjie Zhang, Liming Zhu on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses

Curriculum-Grounded LLM-as-Judge Pipeline Enhances Automated Exam Marking

Why it matters

How to implement this in your domain

Who benefits

Key takeaways

Want to go deeper?

More in AI Engineering & DevTools

AI-Powered Development Workflow Integrates Multiple Models

Proposing AI Usage Transparency for Credible Commentary

MCP and A2A Protocols Standardize Agentic Internet Development