New Trust-Region Diffusion Policies Enhance Massively Parallel On-Policy RL

Huy Le, Onur Celik, Denis Blessing, Tai Hoang, Claas A Voelcker, Axel Brunnbauer, Felix Richter, Michael Volpp, Gerhard Neumann· June 16, 2026 View original

Summary

Researchers introduce Trust-region Diffusion Policies (TruDi), a novel method enabling diffusion models for on-policy reinforcement learning with massively parallel simulations. This approach integrates a trust-region optimization rule to stabilize training with complex policies, outperforming baselines on challenging control tasks.

This new research presents Trust-region Diffusion Policies, or TruDi, a method designed to improve reinforcement learning in environments that use massively parallel simulations and on-policy training. Traditional approaches in this setting often rely on simpler policy models, while more expressive diffusion models have typically been limited to offline or off-policy training. TruDi addresses the inherent challenges of rapidly changing data distributions in on-policy training by incorporating a trust-region optimization rule. This rule enforces a KL-divergence constraint across the entire diffusion trajectory, ensuring stable learning even with complex policy structures. Empirical evaluations across 73 tasks in four massively parallel RL benchmarks demonstrate that TruDi consistently matches or surpasses existing strong baselines. It shows significant improvements on more complex humanoid control tasks, establishing a new benchmark for this domain.

Why it matters

Professionals working with complex simulation environments or robotics can leverage this advancement to develop more robust and performant AI policies, especially in scenarios requiring high-fidelity control and rapid learning.

How to implement this in your domain

  1. 1Explore integrating TruDi's trust-region optimization into existing on-policy RL frameworks for improved stability.
  2. 2Apply diffusion policies in massively parallel simulation environments for complex control problems like robotics or autonomous systems.
  3. 3Benchmark current RL solutions against TruDi on challenging tasks to identify potential performance gains.
  4. 4Investigate the use of KL-divergence constraints across diffusion trajectories to enhance policy training stability.

Who benefits

RoboticsAutonomous VehiclesGamingManufacturingAerospace

Key takeaways

  • TruDi enables the effective use of expressive diffusion policies in massively parallel on-policy reinforcement learning.
  • The method stabilizes training by applying a trust-region optimization rule with a KL-divergence constraint.
  • TruDi demonstrates superior or comparable performance across a wide range of complex control tasks.
  • This research sets a new standard for developing robust policies in high-fidelity simulation environments.

Original post by Huy Le, Onur Celik, Denis Blessing, Tai Hoang, Claas A Voelcker, Axel Brunnbauer, Felix Richter, Michael Volpp, Gerhard Neumann

"arXiv:2606.15260v1 Announce Type: new Abstract: Reinforcement learning with massively parallel simulations has become a standard framework for developing robust, deployable policies; however, most existing approaches still rely on simple Gaussian policy parameterizations. Diffusi…"

View on X

Originally posted by Huy Le, Onur Celik, Denis Blessing, Tai Hoang, Claas A Voelcker, Axel Brunnbauer, Felix Richter, Michael Volpp, Gerhard Neumann on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses