New Dataset Boosts Computer-Use Agent Training Performance

Jaehun Jung, Ximing Lu, Brandon Cui, Muhammad Khalifa, Shaokun Zhang, Hao Zhang, Jin Xu, Amala Sanjay Deshmukh, Karan Sapra, Andrew Tao, Yejin Choi, Jan Kautz, Mingjie Liu, Yi Dong· June 17, 2026 View original

Summary

A new dataset, ProCUA-SFT, comprising 3.1 million step-level samples, has been created to improve the supervised fine-tuning of computer-use agents. This dataset, distilled from synthetic trajectories across thousands of application combinations, significantly enhances agent performance on tasks like OSWorld, outperforming previous large-scale datasets.

Training computer-use agents (CUAs) that interact with graphical interfaces requires extensive and varied data from real desktop environments. Existing large public datasets, such as AgentNet, have shown negative transfer effects when used for supervised fine-tuning, causing a decline in agent success rates. To address this, researchers developed ProCUA-SFT, a new dataset containing 3.1 million step-level samples. This data is derived from 93,000 synthetic trajectories generated across 2,484 different application configurations using a fully automated pipeline. The pipeline synthesizes grounded tasks on live desktops with real-world content and verifies task feasibility before execution. A single Vision-Language Model (VLM) acts as the goal generator, precondition judge, and trajectory executor, ensuring consistency. Fine-tuning models like UI-TARS 7B on ProCUA-SFT resulted in a substantial 18.7 percentage-point improvement on OSWorld tasks, significantly surpassing performance achieved with AgentNet. A portion of ProCUA also contributed to the capabilities of the Nemotron 3 Nano Omni model.

Why it matters

This development offers a superior dataset and methodology for training AI agents capable of interacting with desktop environments, which is crucial for automating complex workflows and improving human-computer interaction.

How to implement this in your domain

  1. 1Evaluate ProCUA-SFT for fine-tuning custom computer-use agents in enterprise automation scenarios.
  2. 2Adopt the automated data synthesis pipeline for generating task-specific training data for UI automation.
  3. 3Benchmark existing computer-use agents against models trained with ProCUA-SFT to identify performance gaps.
  4. 4Explore integrating VLM-driven task generation and verification for robust agent development.

Who benefits

Software DevelopmentIT AutomationBusiness Process AutomationRoboticsAI Development

Key takeaways

  • Large-scale, diverse data is critical for training effective computer-use agents.
  • Existing datasets can lead to negative transfer during supervised fine-tuning.
  • ProCUA-SFT is a new, high-quality synthetic dataset that significantly improves agent performance.
  • Automated data generation pipelines using VLMs can create robust training data.

Original post by Jaehun Jung, Ximing Lu, Brandon Cui, Muhammad Khalifa, Shaokun Zhang, Hao Zhang, Jin Xu, Amala Sanjay Deshmukh, Karan Sapra, Andrew Tao, Yejin Choi, Jan Kautz, Mingjie Liu, Yi Dong

"arXiv:2606.17321v1 Announce Type: new Abstract: Training computer-use agents (CUAs) -- models that interact with graphical desktops through screenshots and keyboard/mouse actions -- requires large-scale, diverse trajectory data collected in full desktop environments. The largest…"

View on X

Originally posted by Jaehun Jung, Ximing Lu, Brandon Cui, Muhammad Khalifa, Shaokun Zhang, Hao Zhang, Jin Xu, Amala Sanjay Deshmukh, Karan Sapra, Andrew Tao, Yejin Choi, Jan Kautz, Mingjie Liu, Yi Dong on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses