PPT-Eval: New Benchmark for AI Agents on PowerPoint Tasks

Apurva Gandhi, Vishwas Suryanarayanan, Raja Hasnain Anwar, Firoz Shaik, Shubhang Desai, Thong Q. Nguyen, Muhammad Taqi Raza, Vishal Chowdhary, Graham Neubig· July 1, 2026 View original

Summary

PPT-Eval is a new benchmark featuring 120 PowerPoint tasks across 12 files, designed to evaluate computer-use agents in content creation and editing. It introduces a robust rubric-based evaluation framework that awards partial credit and provides natural language feedback.

Creating and editing presentations is a ubiquitous and complex activity in professional settings, making it an ideal testbed for evaluating the capabilities of real-world computer-use agents. To address the lack of standardized evaluation in this domain, a new benchmark called PPT-Eval has been introduced. This benchmark comprises 120 diverse PowerPoint tasks spread across 12 files, covering both content creation and editing scenarios, organized by difficulty levels. A central innovation of PPT-Eval is its robust evaluation framework, which tackles the challenge of assessing complex, multimodal tasks that often have multiple valid solutions. Inspired by previous work, this framework uses task-specific rubrics to award partial credit for intermediate steps, penalize unnecessary changes or poor aesthetics, and provide detailed natural language feedback. This nuanced approach has shown a high correlation with human judgments. Initial evaluations using frontier agents like Claude-4.5-Opus reveal that current models still struggle, achieving only a 45% success rate and an average partial score of 57%, indicating significant room for improvement in AI's ability to handle real-world office automation.

Why it matters

This benchmark provides a standardized and nuanced way for professionals to assess and improve the capabilities of AI agents in common, complex office automation tasks, driving progress in real-world productivity tools.

How to implement this in your domain

  1. 1Utilize PPT-Eval to benchmark the performance of your organization's AI automation tools on complex, multimodal tasks.
  2. 2Integrate rubric-based evaluation principles into your internal testing frameworks for AI agents handling office productivity tasks.
  3. 3Identify specific areas where current AI agents struggle in presentation creation and editing to guide future development efforts.
  4. 4Review the benchmark's findings to understand the current limitations of frontier AI models in real-world computer-use scenarios.

Who benefits

Software DevelopmentBusiness ServicesEducationConsultingMarketing

Key takeaways

  • PPT-Eval is a new benchmark for evaluating AI agents on complex PowerPoint tasks.
  • It features 120 tasks across 12 files, covering content creation and editing.
  • A robust rubric-based evaluation framework provides partial credit and natural language feedback.
  • Current frontier agents still struggle, indicating significant room for improvement in office automation AI.

Original post by Apurva Gandhi, Vishwas Suryanarayanan, Raja Hasnain Anwar, Firoz Shaik, Shubhang Desai, Thong Q. Nguyen, Muhammad Taqi Raza, Vishal Chowdhary, Graham Neubig

"arXiv:2606.31154v1 Announce Type: new Abstract: Creating and editing slides is a rich, multimodal activity that is ubiquitous in professional and educational settings, making it an ideal testbed for real-world computer-use agents. Microsoft PowerPoint is among the most widely ado…"

View on X

Originally posted by Apurva Gandhi, Vishwas Suryanarayanan, Raja Hasnain Anwar, Firoz Shaik, Shubhang Desai, Thong Q. Nguyen, Muhammad Taqi Raza, Vishal Chowdhary, Graham Neubig on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses