DiPOD Stabilizes Diffusion Policy Optimization for Reinforcement Learning

Haozhe Jiang, Haiwen Feng, Pieter Abbeel, Jiantao Jiao, Angjoo Kanazawa, Nika Haghtalab· June 15, 2026 View original

▶ The 60-second brief

Summary

A new framework called DiPOD (Diffusion Policy Optimization without Drifting Apart) addresses the instability in diffusion policy-gradient methods used for reinforcement learning post-training. DiPOD mitigates the "double-drift" phenomenon by interleaving self-distillation with policy-improving gradient updates, leading to more stable training and higher rewards.

This research tackles a significant challenge in reinforcement learning (RL) post-training, specifically the instability observed in existing diffusion policy-gradient methods. The authors identify a core issue they term the "double-drift phenomenon," where optimizing a variational surrogate causes the ELBO (Evidence Lower Bound) to diverge from the true log-likelihood, leading to misaligned policy gradients and unreliable policy improvement. To counteract this, the paper introduces DiPOD, a novel diffusion policy optimization framework. DiPOD is designed to maintain tight-bound behavior throughout the training process by strategically interleaving self-distillation with policy-improving gradient updates. This approach translates into a practical algorithm that augments each diffusion policy-gradient update with an on-policy ELBO regularizer. Empirical evaluations across both diffusion language model post-training and continuous-control diffusion policies demonstrate DiPOD's effectiveness. The framework substantially stabilizes training, allowing for the achievement of higher rewards compared to previous methods.

Why it matters

For professionals working on advanced AI systems, particularly in reinforcement learning and generative models, DiPOD offers a more stable and effective method for improving diffusion policies, leading to more reliable and higher-performing agents.

How to implement this in your domain

  1. 1Integrate the DiPOD framework into existing diffusion policy-gradient methods for RL post-training to enhance stability.
  2. 2Apply the on-policy ELBO regularizer to diffusion language model fine-tuning to achieve higher rewards.
  3. 3Experiment with DiPOD for continuous-control diffusion policies to improve agent performance and training reliability.
  4. 4Benchmark DiPOD against current state-of-the-art diffusion policy optimization techniques in your specific applications.

Who benefits

AI/ML EngineeringRoboticsAutonomous SystemsNatural Language ProcessingGaming

Key takeaways

  • Diffusion policy-gradient methods suffer from instability due to a "double-drift" phenomenon.
  • DiPOD stabilizes training by interleaving self-distillation and policy updates.
  • An on-policy ELBO regularizer is key to DiPOD's practical implementation.
  • DiPOD leads to substantially more stable training and higher rewards in various applications.

Original post by Haozhe Jiang, Haiwen Feng, Pieter Abbeel, Jiantao Jiao, Angjoo Kanazawa, Nika Haghtalab

"arXiv:2606.13795v1 Announce Type: new Abstract: RL post-training has become increasingly pivotal for improving diffusion policies, but existing diffusion policy-gradient methods are often unstable and cannot achieve reliable policy improvement. We identify the cause as the double…"

View on X

Originally posted by Haozhe Jiang, Haiwen Feng, Pieter Abbeel, Jiantao Jiao, Angjoo Kanazawa, Nika Haghtalab on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses