PowerOPD Stabilizes On-Policy Distillation for Large Language Models

Anhao Zhao, Junlong Tong, Yingqi Fan, Ping Nie, Wenjie Li, Xiaoyu Shen· June 17, 2026 View original

Summary

This paper introduces PowerOPD, a method that uses a bounded Box-Cox power transformation to stabilize on-policy distillation (OPD) for large language models. PowerOPD addresses the high-variance gradients caused by unbounded log-ratio rewards in vanilla OPD, significantly improving performance and efficiency across mathematical reasoning benchmarks.

On-policy distillation (OPD) is a technique used to train smaller "student" large language models (LLMs) from larger "teacher" models, typically by estimating a reverse-KL objective using student-sampled tokens. While this approach avoids computationally expensive vocabulary-wide calculations, it often suffers from severe training instabilities, including sample inefficiency and unstable generation dynamics. Researchers have identified the root cause of these pathologies: the unbounded nature of the log-ratio reward used in vanilla OPD, which generates extremely high-variance gradients concentrated early in the training process. Standard post-hoc scaling methods are ineffective because they operate after this distortion has already occurred. To mitigate this, the paper proposes PowerOPD, a novel family of natively bounded, sign-consistent rewards derived from the Box-Cox power transformation. This transformation, parameterized by alpha, effectively bounds the reward, with the log-ratio being its degenerate limit as alpha approaches zero. PowerOPD significantly improves performance on mathematical reasoning benchmarks, achieving substantial gains over vanilla and even full-vocabulary OPD, while also reducing training time and GPU memory usage.

Why it matters

Stabilizing and improving the efficiency of LLM distillation is crucial for deploying powerful AI models on more constrained hardware or in scenarios requiring faster inference. This method allows for the creation of smaller, more performant LLMs, making advanced AI more accessible and cost-effective.

How to implement this in your domain

  1. 1Integrate PowerOPD into LLM training pipelines to stabilize on-policy distillation and improve student model performance.
  2. 2Experiment with different alpha parameters in the Box-Cox transformation to find the optimal balance for specific distillation tasks.
  3. 3Apply PowerOPD when fine-tuning smaller LLMs for specialized tasks to achieve better accuracy and faster training.
  4. 4Evaluate the computational savings in wall-clock time and GPU memory when using PowerOPD compared to traditional OPD methods.

Who benefits

AI/ML EngineeringSoftware DevelopmentEdTechAutomotiveCloud Computing

Key takeaways

  • PowerOPD stabilizes LLM on-policy distillation by bounding rewards.
  • It uses a Box-Cox power transformation to address high-variance gradients.
  • The method significantly improves performance and efficiency in LLM training.
  • PowerOPD reduces wall-clock time and GPU memory usage.

Original post by Anhao Zhao, Junlong Tong, Yingqi Fan, Ping Nie, Wenjie Li, Xiaoyu Shen

"arXiv:2606.17199v1 Announce Type: new Abstract: Standard on-policy distillation (OPD) for large language models estimates the reverse-KL objective using student-sampled tokens, yielding an unbiased single-sample Monte Carlo estimator that avoids vocabulary-wide computation. Howev…"

View on X

Originally posted by Anhao Zhao, Junlong Tong, Yingqi Fan, Ping Nie, Wenjie Li, Xiaoyu Shen on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses