OSWorld 2.0 Benchmarks AI Agents on Complex Real-World Tasks

Mengqi Yuan, Zilong Zhou, Xinzhuang Xiong, Weiming Wu, Jiayang Sun, Jiamin Song, Kaiqian Cui, Bowen Wang, Haoyuan Wu, Yitong Li, Dunjie Lu, Haikong Lu, Qi Zhen, Xinyuan Wang, Jiaqi Deng, Yuhao Yang, Cheng Chen, Boyuan Zheng, Alex Su, Xiao Yu, Hao Zou, Saaket Agashe, Xing Han Lu, Manpreet Kaur, Zhengyang Qi, Vincent Sunn Chen, Frederic Sala, Dayiheng Liu, Junyang Lin, Zhou Yu, Yu Su, Siva Reddy, Xin Eric Wang, Peng Qi, Tianbao Xie, Tao Yu· June 30, 2026 View original

Summary

Researchers introduce OSWorld 2.0, a new benchmark featuring 108 long-horizon, real-world computer-use workflows designed to expose limitations of frontier AI agents. The benchmark reveals that current agents struggle with complex phenomena like cross-source reasoning, implicit-state inference, and dynamic environments, completing only a small fraction of tasks.

Existing benchmarks for evaluating AI agents in computer use often fall short in capturing the realism, complexity, and extended duration of real-world tasks. This limitation hinders the ability to fully understand the shortcomings of even the most advanced AI agents. To address this, OSWorld 2.0 has been introduced, a comprehensive benchmark comprising 108 long-horizon computer-use workflows. These tasks are designed to mirror everyday and professional scenarios, with human users typically taking over an hour and a half to complete them. The benchmark specifically targets challenging phenomena common in real workflows but underrepresented in previous evaluations, such as streaming interactions, dynamic environments, cross-source reasoning, inferring implicit states, and requiring visual-spatial precision. Evaluations using OSWorld 2.0 show that even frontier models like Claude Opus 4.8, despite using maximum thinking and batched tool calls, only complete about 20.6% of tasks. GPT-3.5, while more token-efficient, performs even lower. The results indicate that current agents are far from professional-level computer use, often failing due to losing track of constraints, missing mid-task information, guessing instead of asking, and skipping verification, particularly when tasks depend on hidden states they must recover.

Why it matters

This benchmark provides a more realistic and challenging evaluation for AI agents, helping developers identify critical weaknesses and drive advancements towards truly capable general-purpose computer-use AI.

How to implement this in your domain

  1. 1Utilize OSWorld 2.0 as a standard benchmark for evaluating the capabilities of new AI agents designed for computer automation.
  2. 2Focus AI agent development efforts on improving cross-source reasoning, implicit-state inference, and handling dynamic environments.
  3. 3Design agent architectures that prioritize robust state tracking, verification steps, and user clarification mechanisms for complex tasks.
  4. 4Integrate long-horizon, multi-step task completion as a key performance indicator for AI automation projects.

Who benefits

Software DevelopmentIT AutomationBusiness Process AutomationRoboticsCustomer Service

Key takeaways

  • Existing computer-use benchmarks are insufficient for evaluating frontier AI agents on real-world complexity.
  • OSWorld 2.0 introduces 108 long-horizon tasks, revealing significant limitations in current AI agents.
  • Agents struggle with dynamic environments, cross-source reasoning, and implicit state inference.
  • Current AI agents are far from professional-level computer use, often failing on complex constraints and verification.

Original post by Mengqi Yuan, Zilong Zhou, Xinzhuang Xiong, Weiming Wu, Jiayang Sun, Jiamin Song, Kaiqian Cui, Bowen Wang, Haoyuan Wu, Yitong Li, Dunjie Lu, Haikong Lu, Qi Zhen, Xinyuan Wang, Jiaqi Deng, Yuhao Yang, Cheng Chen, Boyuan Zheng, Alex Su, Xiao Yu, Hao Zou, Saaket Agashe, Xing Han Lu, Manpreet Kaur, Zhengyang Qi, Vincent Sunn Chen, Frederic Sala, Dayiheng Liu, Junyang Lin, Zhou Yu, Yu Su, Siva Reddy, Xin Eric Wang, Peng Qi, Tianbao Xie, Tao Yu

"arXiv:2606.29537v1 Announce Type: new Abstract: Existing computer-use benchmarks fail to capture the realism, complexity, and long-horizon demands of real-world computer use, limiting their ability to reveal the limitations of frontier agents. We introduce OSWorld 2.0, a benchmar…"

View on X

Originally posted by Mengqi Yuan, Zilong Zhou, Xinzhuang Xiong, Weiming Wu, Jiayang Sun, Jiamin Song, Kaiqian Cui, Bowen Wang, Haoyuan Wu, Yitong Li, Dunjie Lu, Haikong Lu, Qi Zhen, Xinyuan Wang, Jiaqi Deng, Yuhao Yang, Cheng Chen, Boyuan Zheng, Alex Su, Xiao Yu, Hao Zou, Saaket Agashe, Xing Han Lu, Manpreet Kaur, Zhengyang Qi, Vincent Sunn Chen, Frederic Sala, Dayiheng Liu, Junyang Lin, Zhou Yu, Yu Su, Siva Reddy, Xin Eric Wang, Peng Qi, Tianbao Xie, Tao Yu on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses