LLMs Evaluated on Scrum Certification Questions: Gemini 3 Fl

LLMs Evaluated on Scrum Certification Questions: Gemini 3 Flash Leads

Robson Alves Vilar, Emanuel Dantas Filho, Ademar Fran\c{c}a de Sousa Neto, Mirko Perkusich, Danyllo Wagner Albuquerque, Jo\~ao Paiva, Kyller Gorg\^onio, Angelo Perkusich· July 2, 2026 View original

Summary

This paper compares GPT-5 mini, Gemini 3 Flash, and DeepSeek Chat 3.2 on 993 Scrum certification-style questions using various prompting strategies. Gemini 3 Flash achieved the highest accuracy, and the study analyzed performance across topics and question formats, identifying systematic error patterns.

Large Language Models (LLMs) are increasingly being used for exam preparation and knowledge assessment in specialized domains. This research evaluates the performance of three prominent LLMs—GPT-5 mini, Gemini 3 Flash, and DeepSeek Chat 3.2—on a comprehensive set of 993 Scrum certification-style questions, specifically aligned with the Professional Scrum Master I (PSM I) assessment. The models were tested using three prompting strategies: zero-shot, chain-of-thought, and source-grounded, with repeated executions to gauge stability. Results indicated clear performance differences, with Gemini 3 Flash demonstrating the highest accuracy, followed by GPT-5 mini and DeepSeek Chat 3.2, while intra-model variability remained low. Analysis revealed that models performed best on single-answer multiple-choice questions and struggled more with multi-select and True/False formats. Performance was stronger in normatively explicit Scrum areas like Artifacts and Empiricism, but weaker in more interpretive topics such as Scrum Values and Self-Managing Teams. Qualitative analysis uncovered systematic error patterns, including overgeneralization and conflicts with strict Scrum definitions.

Why it matters

Professionals relying on LLMs for learning, coaching, or certification preparation in Agile methodologies need to understand their capabilities and limitations. This study provides critical insights into which models perform best and where their knowledge gaps lie regarding normative frameworks like Scrum.

How to implement this in your domain

1Use Gemini 3 Flash or GPT-5 mini for Scrum-related queries, especially for factual recall on artifacts and empiricism.
2Employ chain-of-thought or source-grounded prompting for improved accuracy when seeking Scrum advice from LLMs.
3Supplement LLM-generated answers with human expert review, particularly for multi-select questions or interpretive Scrum values.
4Develop internal guidelines for LLM usage in Agile training, highlighting areas where models are less reliable.

Who benefits

Software DevelopmentIT ServicesConsultingEdTechProject Management

Key takeaways

Gemini 3 Flash outperformed GPT-5 mini and DeepSeek Chat 3.2 on Scrum questions.
LLMs are more accurate on single-choice questions and explicit Scrum topics.
Multi-select and interpretive topics like Scrum Values are more error-prone.
Systematic error patterns include overgeneralization and conflicts with strict definitions.

Original post by Robson Alves Vilar, Emanuel Dantas Filho, Ademar Fran\c{c}a de Sousa Neto, Mirko Perkusich, Danyllo Wagner Albuquerque, Jo\~ao Paiva, Kyller Gorg\^onio, Angelo Perkusich

"arXiv:2607.00048v1 Announce Type: cross Abstract: Large Language Models (LLMs) are increasingly used in exam- and certification-style question answering tasks, where their ability to retrieve, interpret, and apply domain-specific knowledge can be systematically assessed. In Softw…"

View on X

Originally posted by Robson Alves Vilar, Emanuel Dantas Filho, Ademar Fran\c{c}a de Sousa Neto, Mirko Perkusich, Danyllo Wagner Albuquerque, Jo\~ao Paiva, Kyller Gorg\^onio, Angelo Perkusich on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses

LLMs Evaluated on Scrum Certification Questions: Gemini 3 Flash Leads

Why it matters

How to implement this in your domain

Who benefits

Key takeaways

Want to go deeper?

More in AI Research

Human Feedback Guides Generative Meta-Learning for Robust Generalization.

Valdi: Value Diffusion World Models for MPC

Task-Aware LLM Quantization Improves Efficiency and Performance.