RL for GUI Agents Enhanced by Autonomous Evaluation

Marta Sumyk, Oleksandr Kosovan· June 24, 2026 View original

Summary

This paper introduces a reinforcement learning framework for computer-use agents that leverages autonomous vision-language evaluation as a scalable reward signal. By modeling evaluator feedback as noisy binary rewards and applying a noise-corrected estimator, the framework significantly improves agent success rates across various desktop environments.

Training Computer-Use Agents (CUAs) with reinforcement learning (RL) in open-ended desktop environments is challenging due to the scarcity of scalable, machine-readable reward signals. Task success in graphical user interfaces (GUIs) is often visually determined and difficult to specify with manual labels or handcrafted reward functions. The researchers propose an RL fine-tuning framework that utilizes autonomous vision-language evaluation to provide a scalable supervision signal for GUI agents. A Vision-Language Model (VLM) assesses task completion from a final screenshot and the original instruction, offering terminal feedback without requiring task-specific heuristics or manual labeling during policy optimization. Recognizing that autonomous evaluators are imperfect, their feedback is modeled as a noisy binary reward channel. A noise-corrected reward estimator is then derived for Proximal Policy Optimization. Experiments across macOSWorld, Windows Agent Arena, and OSWorld benchmarks demonstrated that these corrected evaluator rewards significantly outperformed both zero-shot baselines and raw evaluator rewards, leading to an average improvement of 12.6 percentage points over zero-shot performance and 5.1 points over raw evaluator fine-tuning. This suggests that autonomous evaluation, when noise is explicitly modeled, can serve as a practical and effective reward signal for RL in GUI environments.

Why it matters

Professionals developing autonomous agents for desktop automation can significantly improve their training efficiency and performance by adopting autonomous vision-language evaluation as a scalable reward mechanism, especially in complex GUI environments where manual reward engineering is impractical.

How to implement this in your domain

  1. 1Integrate VLM evaluators: Employ Vision-Language Models to autonomously assess task completion from screenshots and instructions for GUI automation agents.
  2. 2Apply noise correction: Implement noise-corrected reward estimators in RL frameworks to account for imperfections in autonomous evaluators, improving learning stability and performance.
  3. 3Fine-tune with autonomous rewards: Utilize the proposed RL fine-tuning framework to train computer-use agents using scalable, automatically generated reward signals.
  4. 4Benchmark across platforms: Test and validate the performance of RL-trained GUI agents across diverse operating system environments like macOS, Windows, and Linux.

Who benefits

Software DevelopmentIT AutomationRoboticsBusiness Process Automation

Key takeaways

  • Autonomous vision-language evaluation provides scalable reward signals for RL in GUI environments.
  • Modeling evaluator feedback as noisy binary rewards is crucial for effective learning.
  • Noise-corrected reward estimators significantly improve GUI agent success rates.
  • This approach reduces reliance on handcrafted rewards or dense manual labels for RL.

Original post by Marta Sumyk, Oleksandr Kosovan

"arXiv:2606.24515v1 Announce Type: new Abstract: Computer-Use Agents (CUAs) execute high-level user goals by perceiving and acting directly within graphical user interfaces. However, reinforcement learning for CUAs remains difficult because open-ended desktop environments rarely p…"

View on X

Originally posted by Marta Sumyk, Oleksandr Kosovan on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses