EMAgnet Improves Policy Gradient Self-Play in Large Games

Tristan Maidment, JB Lanier, Chase McDonald, Nathan Tsang, Eugene Vinitsky, Roy Fox, Albert Wang, Wesley N. Kerr· June 24, 2026 View original

Summary

Researchers introduce EMAgnet, a novel regularization technique for policy gradient self-play that uses an exponential moving average of past policy parameters as an adaptive target. This method consistently achieves lower exploitability in complex two-player zero-sum games compared to existing approaches.

Policy gradient methods, particularly when used in self-play, have shown great promise in solving complex two-player zero-sum imperfect-information games, sometimes outperforming specialized game-theoretic algorithms. A common regularization strategy involves targeting a uniform distribution, but this approach treats all actions equally, regardless of their strategic viability. The new technique, EMAgnet, addresses this limitation by introducing an adaptive regularization target. Instead of a static uniform distribution, EMAgnet regularizes the policy towards an exponential moving average (EMA) of the last-iterate policy's parameters. This allows the regularization target to evolve dynamically with the agent's improving strategy. Evaluations on standard and modified two-player zero-sum benchmarks, including those with significant exploration challenges and many strictly dominated strategies, demonstrated EMAgnet's effectiveness. It consistently achieved lower exploitability in most tested environments, showing clear performance gains, especially in games where strictly dominated strategies are prevalent, compared to PPO self-play with uniform-magnet regularization.

Why it matters

This research advances the state-of-the-art in reinforcement learning for multi-agent systems and game theory, offering a more robust and efficient method for training agents in complex strategic environments.

How to implement this in your domain

  1. 1Explore integrating EMAgnet's adaptive regularization into your existing policy gradient self-play algorithms.
  2. 2Apply EMAgnet to train AI agents for complex strategic games or simulations.
  3. 3Benchmark EMAgnet's performance against uniform regularization in environments with exploration challenges.
  4. 4Consider using EMAgnet for developing more robust and less exploitable AI opponents or teammates.
  5. 5Investigate its applicability in multi-agent reinforcement learning scenarios beyond zero-sum games.

Who benefits

GamingRoboticsAutonomous SystemsDefenseFinancial Trading

Key takeaways

  • EMAgnet introduces adaptive regularization for policy gradient self-play.
  • It uses an exponential moving average of policy parameters as a dynamic target.
  • EMAgnet consistently reduces exploitability in two-player zero-sum games.
  • It performs particularly well in games with many strictly dominated strategies.

Original post by Tristan Maidment, JB Lanier, Chase McDonald, Nathan Tsang, Eugene Vinitsky, Roy Fox, Albert Wang, Wesley N. Kerr

"arXiv:2606.23995v1 Announce Type: new Abstract: Recent work has established that regularized policy gradient methods such as PPO, when used in self-play, can match or exceed specialized game-theoretic algorithms for solving two-player zero-sum imperfect-information games. The uni…"

View on X

Originally posted by Tristan Maidment, JB Lanier, Chase McDonald, Nathan Tsang, Eugene Vinitsky, Roy Fox, Albert Wang, Wesley N. Kerr on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses