Quality-Aware Self-Distillation Improves GUI Grounding in VLMs

Jingyuan Huang, Zuming Huang, Yucheng Shi, Tianze Yang, Xiaoming Zhai, Wei Chu, Ninghao Liu· June 17, 2026 View original

Summary

A new quality-aware self-distillation method enhances vision-language models (VLMs) for GUI grounding by improving the reliability of coordinate-token teacher signals. It uses soft correctness-aware gating and teacher-probability scaling to mitigate signal degradation when student-generated prefixes deviate from target coordinates.

This research introduces a novel quality-aware self-distillation technique designed to significantly improve the performance of vision-language models (VLMs) in graphical user interface (GUI) grounding tasks. GUI grounding requires VLMs to precisely identify small target elements within high-resolution screenshots and predict their exact screen coordinates. The challenge with existing on-policy self-distillation (OPSD) methods for this task is that the quality of teacher signals can degrade when the student model's generated prefix deviates from the correct target coordinate, leading to unreliable supervision. To address this, the proposed method employs two complementary mechanisms: soft correctness-aware gating and teacher-probability scaling. The correctness-aware gate checks if the teacher's current coordinate-token prediction can still lead to the ground-truth box given the student's prefix, down-weighting unreliable signals. Teacher-probability scaling then uses the teacher's confidence to further calibrate the strength of the remaining signals. Experiments across six GUI grounding benchmarks consistently showed performance improvements, indicating that these two mechanisms work together effectively to suppress unreliable supervision and calibrate signal strength.

Why it matters

For professionals developing AI for UI automation, accessibility, or human-computer interaction, this advancement offers a more robust method for training VLMs to accurately understand and interact with graphical interfaces. Improved GUI grounding can lead to more reliable automated testing, more intuitive assistive technologies, and more precise AI agents for user interaction.

How to implement this in your domain

  1. 1Apply quality-aware self-distillation techniques when training VLMs for GUI grounding tasks.
  2. 2Implement soft correctness-aware gating to filter out unreliable teacher signals during self-distillation.
  3. 3Incorporate teacher-probability scaling to calibrate the strength of supervision based on teacher confidence.
  4. 4Evaluate the combined effect of these mechanisms on VLM performance across various GUI benchmarks.
  5. 5Integrate improved GUI grounding models into applications requiring precise UI element identification.

Who benefits

Software DevelopmentUI/UX DesignAccessibility TechRoboticsGaming

Key takeaways

  • GUI grounding requires VLMs to identify precise screen coordinates of UI elements.
  • Naive self-distillation can suffer from unreliable teacher signals when student predictions deviate.
  • Quality-aware self-distillation uses correctness-aware gating and probability scaling to improve signal quality.
  • This method consistently enhances VLM performance on GUI grounding benchmarks.

Original post by Jingyuan Huang, Zuming Huang, Yucheng Shi, Tianze Yang, Xiaoming Zhai, Wei Chu, Ninghao Liu

"arXiv:2606.18101v1 Announce Type: new Abstract: Graphical user interface (GUI) grounding requires vision-language models (VLMs) to identify small target elements in high-resolution screenshots and predict precise screen coordinates. On-policy self-distillation (OPSD) is a promisi…"

View on X

Originally posted by Jingyuan Huang, Zuming Huang, Yucheng Shi, Tianze Yang, Xiaoming Zhai, Wei Chu, Ninghao Liu on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses