LLMs Outperform Supervised Models in Cross-Dataset Bloom Que

LLMs Outperform Supervised Models in Cross-Dataset Bloom Question Classification.

Abdolali Faraji, Mohammadreza Molavi, Zohreh Rasoulkhani, Mohammadreza Tavakoli, G\'abor Kismih\'ok· June 15, 2026 View original

Summary

A study evaluated the cross-dataset generalization of machine learning models and the effectiveness of prompted LLMs for Bloom's taxonomy classification of assessment questions. LLMs, especially with in-context examples and course-specific action verbs, proved more stable and robust across diverse educational contexts than traditional supervised models.

Automatically classifying assessment questions according to Bloom's taxonomy can significantly reduce the workload for instructors, but the process is often subjective and varies between teachers. While previous machine learning and deep learning methods showed strong results within specific datasets, their real-world generalizability across different datasets remained unclear. Furthermore, the effectiveness of large language models (LLMs) for this task had not been systematically studied. This research systematically evaluated both existing ML/DL methods and LLMs with various prompting strategies on five different datasets to assess their cross-dataset generalization. The findings revealed that traditional supervised ML/DL models experienced substantial performance degradation when applied to unseen datasets. In contrast, LLMs demonstrated greater stability and robustness across diverse educational contexts. The most effective prompting strategy involved combining in-context examples with course-specific action verbs. Based on these insights, a lightweight user interface was developed to help instructors automatically classify large question banks, which a usability study confirmed to be highly usable and reduce workload.

Why it matters

Educators and EdTech professionals can leverage LLMs to automate and standardize the classification of assessment questions, significantly reducing manual workload and improving consistency. This enables more efficient curriculum development and better alignment of assessments with learning objectives.

How to implement this in your domain

1Integrate LLM-based classification tools into educational platforms for automated question tagging.
2Develop prompting strategies for LLMs that include in-context examples and domain-specific vocabulary for improved accuracy.
3Train instructors on using LLM-powered tools for Bloom's taxonomy classification to streamline assessment creation.
4Evaluate the consistency and accuracy of LLM classifications against human experts in specific educational contexts.

Who benefits

EdTechEducationCurriculum DevelopmentAssessment DesignAI Development

Key takeaways

LLMs are more robust than supervised models for cross-dataset Bloom's taxonomy classification.
Effective prompting strategies combine in-context examples with course-specific action verbs.
LLM-based tools can significantly reduce instructor workload in classifying question banks.
This approach improves consistency and efficiency in educational assessment design.

Original post by Abdolali Faraji, Mohammadreza Molavi, Zohreh Rasoulkhani, Mohammadreza Tavakoli, G\'abor Kismih\'ok

"arXiv:2606.13684v1 Announce Type: cross Abstract: Automatic Bloom's taxonomy classification of assessment questions can substantially reduce instructor workload, but labeling is subjective and teacher-dependent. Prior machine learning (ML) and deep learning (DL) approaches report…"

View on X

Originally posted by Abdolali Faraji, Mohammadreza Molavi, Zohreh Rasoulkhani, Mohammadreza Tavakoli, G\'abor Kismih\'ok on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses

LLMs Outperform Supervised Models in Cross-Dataset Bloom Question Classification.

Why it matters

How to implement this in your domain

Who benefits

Key takeaways

Want to go deeper?

More in AI Research

VISReg Enhances JEPA Training with Novel Regularization

Margaret Atwood Criticizes AI for "Garbage In, Garbage Out" Flaw

Podcast Explores Large Test-Time Compute and AI Model Budgets