LLMs Evaluated on Scrum Certification Questions: Gemini 3 Flash Leads

Robson Alves Vilar, Emanuel Dantas Filho, Ademar Fran\c{c}a de Sousa Neto, Mirko Perkusich, Danyllo Wagner Albuquerque, Jo\~ao Paiva, Kyller Gorg\^onio, Angelo Perkusich· July 2, 2026 View original

Summary

This paper compares GPT-5 mini, Gemini 3 Flash, and DeepSeek Chat 3.2 on 993 Scrum certification-style questions using various prompting strategies. Gemini 3 Flash achieved the highest accuracy, and the study analyzed performance across topics and question formats, identifying systematic error patterns.

Large Language Models (LLMs) are increasingly being used for exam preparation and knowledge assessment in specialized domains. This research evaluates the performance of three prominent LLMs—GPT-5 mini, Gemini 3 Flash, and DeepSeek Chat 3.2—on a comprehensive set of 993 Scrum certification-style questions, specifically aligned with the Professional Scrum Master I (PSM I) assessment. The models were tested using three prompting strategies: zero-shot, chain-of-thought, and source-grounded, with repeated executions to gauge stability. Results indicated clear performance differences, with Gemini 3 Flash demonstrating the highest accuracy, followed by GPT-5 mini and DeepSeek Chat 3.2, while intra-model variability remained low. Analysis revealed that models performed best on single-answer multiple-choice questions and struggled more with multi-select and True/False formats. Performance was stronger in normatively explicit Scrum areas like Artifacts and Empiricism, but weaker in more interpretive topics such as Scrum Values and Self-Managing Teams. Qualitative analysis uncovered systematic error patterns, including overgeneralization and conflicts with strict Scrum definitions.

Why it matters

Professionals relying on LLMs for learning, coaching, or certification preparation in Agile methodologies need to understand their capabilities and limitations. This study provides critical insights into which models perform best and where their knowledge gaps lie regarding normative frameworks like Scrum.

How to implement this in your domain

  1. 1Use Gemini 3 Flash or GPT-5 mini for Scrum-related queries, especially for factual recall on artifacts and empiricism.
  2. 2Employ chain-of-thought or source-grounded prompting for improved accuracy when seeking Scrum advice from LLMs.
  3. 3Supplement LLM-generated answers with human expert review, particularly for multi-select questions or interpretive Scrum values.
  4. 4Develop internal guidelines for LLM usage in Agile training, highlighting areas where models are less reliable.

Who benefits

Software DevelopmentIT ServicesConsultingEdTechProject Management

Key takeaways

  • Gemini 3 Flash outperformed GPT-5 mini and DeepSeek Chat 3.2 on Scrum questions.
  • LLMs are more accurate on single-choice questions and explicit Scrum topics.
  • Multi-select and interpretive topics like Scrum Values are more error-prone.
  • Systematic error patterns include overgeneralization and conflicts with strict definitions.

Original post by Robson Alves Vilar, Emanuel Dantas Filho, Ademar Fran\c{c}a de Sousa Neto, Mirko Perkusich, Danyllo Wagner Albuquerque, Jo\~ao Paiva, Kyller Gorg\^onio, Angelo Perkusich

"arXiv:2607.00048v1 Announce Type: cross Abstract: Large Language Models (LLMs) are increasingly used in exam- and certification-style question answering tasks, where their ability to retrieve, interpret, and apply domain-specific knowledge can be systematically assessed. In Softw…"

View on X

Originally posted by Robson Alves Vilar, Emanuel Dantas Filho, Ademar Fran\c{c}a de Sousa Neto, Mirko Perkusich, Danyllo Wagner Albuquerque, Jo\~ao Paiva, Kyller Gorg\^onio, Angelo Perkusich on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses

More in AI Research

AI ResearchAI Engineering & DevTools

Human Feedback Guides Generative Meta-Learning for Robust Generalization.

This paper introduces Generative Meta-Learning with Human Feedback (GMHF), a framework that uses expert intuition to guide data synthesis and bridge the domain gap for machine learning models. GMHF employs a Conditional Neural ODE as a generative digital twin and an RL agent to refine latent physical parameters based on feedback, significantly reducing deployment loss and improving generalization under distribution shifts.

Midhun Parakkal Unni, Samuel KaskiJul 2, 2026
AI ResearchAI Engineering & DevTools

Valdi: Value Diffusion World Models for MPC

Valdi introduces Value Diffusion World Models, combining end-to-end online training for Model Predictive Control (MPC) with a latent diffusion dynamics model. Preliminary experiments show that Valdi, using a single diffusion step, matches deterministic MLP baselines in the CarRacing environment, highlighting a trade-off between predictive multimodality and control performance.

Christopher Lindenberg, Kashyap ChittaJul 2, 2026
AI Engineering & DevToolsAI Research

Task-Aware LLM Quantization Improves Efficiency and Performance.

This paper introduces TASA (Task-Aware Sensitivity Analysis), a two-level framework for mixed-precision quantization of large language models (LLMs) that optimizes calibration data composition and bit allocation. TASA addresses the "Perplexity Illusion" and the "Alignment-Diversity Tradeoff," enabling 3.5-bit models to match or surpass 4-bit baselines by jointly considering perplexity and reasoning-oriented sensitivity.

Fei Wang, Chao Xue, Taoran Liu, Li Shen, Ye Liu, ChangXing DingJul 2, 2026