Evaluating LLM Cognitive Depth in Educational Question Generation

Xiaolong Wang, Zhe Zhao, Song Lai, Chaoli Zhang, Zijie Geng, Yu Tong, Ye Wei, Qingsong Wen· June 18, 2026 View original

▶ The 60-second brief

Summary

This work evaluates six large language models' ability to generate educational questions that stimulate higher-order thinking, moving beyond rote memorization. Using a hybrid human-AI protocol and Bloom's Taxonomy, researchers found that fine-grained prompting strategies significantly reduce repetitiveness and increase the proportion of higher-order cognitive outputs, with InternLM3 showing superior performance in multi-level transitions.

While large language models (LLMs) show considerable potential for automating educational content creation, their capacity to generate questions that encourage deep, higher-order thinking, rather than just recall, has been underexplored. This research addresses that gap by evaluating six prominent LLMs through the lens of Bloom's Taxonomy, focusing on their ability to move beyond simple memorization. A hybrid human-AI evaluation protocol was employed to analyze over 20,000 questions across computer science, K-12 math, and social science domains. Key findings include the development of a fine-grained prompting strategy that effectively reduced question repetitiveness by 24.45% for Qwen2.5-7B-Instruct and boosted higher-order cognitive level outputs by 11.53% for InternLM3-8B-Instruct. The study also introduced quantitative metrics like Cognitive Shift Intensity (CogShift) and category drift, which revealed InternLM3's superior performance in facilitating multi-level cognitive transitions. An interpretability analysis further highlighted correlations that enhance the transparency of Chain-of-Thought prompting. These results underscore the importance of cognitive-aware prompt design for deploying LLMs in personalized learning systems.

Why it matters

This research is critical for educators and EdTech developers aiming to leverage AI for creating more effective and engaging learning materials. It provides insights into how to prompt LLMs to generate questions that truly challenge students and foster deeper understanding, moving beyond superficial knowledge checks.

How to implement this in your domain

  1. 1Apply fine-grained, cognitive-aware prompting strategies when using LLMs to generate educational content.
  2. 2Integrate Bloom's Taxonomy principles into AI-driven question generation systems for higher-order thinking.
  3. 3Evaluate LLM-generated questions using metrics like CogShift to assess cognitive depth and variety.
  4. 4Customize personalized learning systems with LLMs capable of producing diverse and challenging question types.

Who benefits

EdTechEducationCorporate TrainingAI EngineeringContent Creation

Key takeaways

  • LLMs can generate higher-order thinking questions with cognitive-aware prompting.
  • Fine-grained prompting reduces repetitiveness and increases cognitive depth in outputs.
  • InternLM3 showed superior performance in multi-level cognitive transitions.
  • Bloom's Taxonomy is a valuable lens for evaluating LLM-generated educational content.

Original post by Xiaolong Wang, Zhe Zhao, Song Lai, Chaoli Zhang, Zijie Geng, Yu Tong, Ye Wei, Qingsong Wen

"arXiv:2606.18257v1 Announce Type: cross Abstract: While LLMs show promise in automating educational content creation, their ability to generate questions that stimulate higher-order thinking remains understudied. This work evaluates six widely-used LLMs through a Bloom's Taxonomy…"

View on X

Originally posted by Xiaolong Wang, Zhe Zhao, Song Lai, Chaoli Zhang, Zijie Geng, Yu Tong, Ye Wei, Qingsong Wen on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses