AI Coaches Struggle with Explanations and Visual Grounding i

AI Coaches Struggle with Explanations and Visual Grounding in Software Training

Meng Chen, Anya Ji, Tsung-Han Wu, Tobias Maringgele, David M. Chan, Alane Suhr, Amy Pavel· July 2, 2026 View original

Summary

A new multimodal dataset, DigitalCoach, reveals that state-of-the-art AI models, when coaching humans on computer use, provide more direct instructions but fewer explanations, error diagnoses, or knowledge checks than human experts. Models also struggle with visual grounding, leading to passive learning.

As AI agents become more adept at automating software tasks, their potential to teach humans how to use software is being explored. Researchers introduced DigitalCoach, a comprehensive multimodal dataset comprising 72 human expert-novice coaching sessions, totaling over 28 hours of screen and input recordings across five software applications. This dataset was used to evaluate the effectiveness of current AI models as computer use coaches. The findings indicate a significant difference in coaching styles: AI models tend to offer direct instructions but fall short in providing explanations, diagnosing errors, or checking for understanding, unlike human coaches. Furthermore, while AI models can generate human-like utterances when the coaching method is fixed, they demonstrate poor grounding in the visual context of the screen. Interactive evaluations confirmed that learners coached by AI agents often follow instructions passively without deep engagement, highlighting a critical gap in visual understanding and pedagogical approach.

Why it matters

Professionals developing AI-powered educational tools or internal training systems need to understand these limitations to design more effective and engaging learning experiences that go beyond mere instruction.

How to implement this in your domain

1Integrate multimodal input (screen recordings, user actions) into AI coaching systems to improve visual grounding.
2Develop AI models with explicit objectives for explanation generation, error diagnosis, and knowledge assessment.
3Design interactive learning modules that encourage active engagement rather than passive instruction following.
4Conduct user studies to compare AI-led coaching effectiveness against human-led coaching for specific software tasks.

Who benefits

EdTechCorporate TrainingSoftware DevelopmentCustomer Support

Key takeaways

Current AI coaches prioritize direct instructions over explanations and error diagnosis.
AI models struggle with visual grounding in real-time computer use coaching.
Learners tend to be passive when coached by current AI systems.
Future AI coaching agents need improved pedagogical strategies and multimodal understanding.

Original post by Meng Chen, Anya Ji, Tsung-Han Wu, Tobias Maringgele, David M. Chan, Alane Suhr, Amy Pavel

"arXiv:2606.31980v1 Announce Type: cross Abstract: Agents are increasingly capable of automating software tasks, but can they teach humans how to use software themselves? We introduce DigitalCoach, a multimodal dataset of 72 human expert-novice computer use coaching sessions consi…"

View on X

Originally posted by Meng Chen, Anya Ji, Tsung-Han Wu, Tobias Maringgele, David M. Chan, Alane Suhr, Amy Pavel on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses

AI Coaches Struggle with Explanations and Visual Grounding in Software Training

Why it matters

How to implement this in your domain

Who benefits

Key takeaways

Want to go deeper?

More in AI Research

Human Feedback Guides Generative Meta-Learning for Robust Generalization.

Valdi: Value Diffusion World Models for MPC

Task-Aware LLM Quantization Improves Efficiency and Performance.