GPTNT Benchmarks Real-Time Multimodal Agent Collaboration

Amit Parekh, Sabrina McCallum, Kareem Al-Hasan, Malvina Nikandrou, Alessandro Suglia, Ioannis Konstas· June 30, 2026 View original

Summary

GPTNT is a new benchmark using the game "Keep Talking and Nobody Explodes" to evaluate real-time collaboration between multimodal AI agents under time pressure and information asymmetry. It reveals critical weaknesses in state-of-the-art models regarding state tracking, efficient action, ambiguity handling, and error recovery.

This paper introduces GPTNT, a novel benchmark designed to assess the real-time collaborative capabilities of multimodal AI agents. The benchmark is built around the cooperative video game "Keep Talking and Nobody Explodes," where two agents must work together to defuse procedurally generated bombs against a live countdown. One agent possesses visual and manipulative access to the bomb but lacks defusal instructions, while the other has the instructions but cannot see or interact with the bomb. Success hinges on effective and efficient communication under conditions of time pressure and information asymmetry. Unlike turn-based evaluations, GPTNT requires agents to communicate asynchronously and act in real time. The benchmark is structured to isolate collaborative performance from memorized solutions, allowing researchers to withhold either the instruction manual or the partner. Experiments with state-of-the-art closed- and open-source models reveal significant challenges: none of the tested models successfully defused a single bomb in real time, a task human players readily accomplish. The study identifies critical weaknesses in areas such as state tracking, efficient action under pressure, handling ambiguous communication, and error recovery. GPTNT is released as a public benchmark to drive advancements in collaborative AI.

Why it matters

This benchmark highlights current limitations in AI's ability to perform real-time, complex collaboration under pressure, which is crucial for developing AI systems for dynamic human-AI or multi-AI team environments.

How to implement this in your domain

  1. 1Utilize GPTNT as a benchmark for developing and testing AI agents intended for collaborative tasks in dynamic environments.
  2. 2Focus R&D efforts on improving AI capabilities in real-time state tracking, efficient decision-making under time constraints, and robust error recovery.
  3. 3Explore how insights from GPTNT can inform the design of human-AI interfaces for collaborative problem-solving.

Who benefits

RoboticsGamingDefenseLogisticsEmergency Services

Key takeaways

  • GPTNT is a new benchmark for real-time multimodal agent collaboration.
  • It exposes significant weaknesses in current state-of-the-art AI models.
  • Key challenges include state tracking, efficient action, and error recovery.
  • Human-level collaborative performance remains a substantial hurdle for AI.

Original post by Amit Parekh, Sabrina McCallum, Kareem Al-Hasan, Malvina Nikandrou, Alessandro Suglia, Ioannis Konstas

"arXiv:2606.28514v1 Announce Type: new Abstract: Multimodal models are increasingly deployed to solve tasks collaboratively with humans or other artificial agents. Existing benchmarks show that these models possess many of the required component capabilities, but the conditions th…"

View on X

Originally posted by Amit Parekh, Sabrina McCallum, Kareem Al-Hasan, Malvina Nikandrou, Alessandro Suglia, Ioannis Konstas on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses