PPT-Eval: New Benchmark for AI Agents on PowerPoint Tasks
Summary
PPT-Eval is a new benchmark featuring 120 PowerPoint tasks across 12 files, designed to evaluate computer-use agents in content creation and editing. It introduces a robust rubric-based evaluation framework that awards partial credit and provides natural language feedback.
Why it matters
This benchmark provides a standardized and nuanced way for professionals to assess and improve the capabilities of AI agents in common, complex office automation tasks, driving progress in real-world productivity tools.
How to implement this in your domain
- 1Utilize PPT-Eval to benchmark the performance of your organization's AI automation tools on complex, multimodal tasks.
- 2Integrate rubric-based evaluation principles into your internal testing frameworks for AI agents handling office productivity tasks.
- 3Identify specific areas where current AI agents struggle in presentation creation and editing to guide future development efforts.
- 4Review the benchmark's findings to understand the current limitations of frontier AI models in real-world computer-use scenarios.
Who benefits
Key takeaways
- PPT-Eval is a new benchmark for evaluating AI agents on complex PowerPoint tasks.
- It features 120 tasks across 12 files, covering content creation and editing.
- A robust rubric-based evaluation framework provides partial credit and natural language feedback.
- Current frontier agents still struggle, indicating significant room for improvement in office automation AI.
Original post by Apurva Gandhi, Vishwas Suryanarayanan, Raja Hasnain Anwar, Firoz Shaik, Shubhang Desai, Thong Q. Nguyen, Muhammad Taqi Raza, Vishal Chowdhary, Graham Neubig
"arXiv:2606.31154v1 Announce Type: new Abstract: Creating and editing slides is a rich, multimodal activity that is ubiquitous in professional and educational settings, making it an ideal testbed for real-world computer-use agents. Microsoft PowerPoint is among the most widely ado…"
View on XPrimary sources
Originally posted by Apurva Gandhi, Vishwas Suryanarayanan, Raja Hasnain Anwar, Firoz Shaik, Shubhang Desai, Thong Q. Nguyen, Muhammad Taqi Raza, Vishal Chowdhary, Graham Neubig on X · view source
Want to go deeper?
Turn these trends into skills with Learnijoy's hands-on AI & tech courses.
Explore coursesMore in AI Engineering & DevTools

New Keyboard Optimized for Claude AI Launched
A new keyboard has been released that is specifically designed and optimized for use with the Claude AI assistant. This product aims to enhance the user experience when interacting with the AI.
Godot Engine Bans AI-Authored Code Contributions
The Godot game engine project has announced it will no longer accept code contributions generated by AI tools. This policy change is driven by concerns regarding licensing, copyright, and the overall maintainability of the codebase.

ElevenLabs Offers Singapore Data Residency for Enterprise AI Services
ElevenLabs has launched data residency in Singapore for its enterprise AI products, including ElevenAgents, ElevenCreative, and ElevenAPI. This allows businesses to host data and inference locally, ensuring compliance and lower latency in the region.