On-Policy Self-Distillation Limits in Continual Learning Revealed

Meng Wang, Haohan Zhao, Wenzhuo Liu, Lu Yang, Geng Liu, Haiyang Guo, Guo-Sen Xie, Gaofeng Meng, Hongbin Liu, Fei Zhu· July 3, 2026 View original

Summary

This research revisits the effectiveness of on-policy self-distillation for continual post-training of foundation models, finding that denser self-distillation can accelerate in-domain specialization but struggles with out-of-distribution scenarios and can even lead to catastrophic forgetting.

Continual post-training is a critical technique for enabling foundation models to acquire new knowledge while retaining existing capabilities. On-policy learning, particularly on-policy self-distillation, has been viewed optimistically as a method to mitigate forgetting. However, this study, using self-distillation policy optimization (SDPO), challenges that view. The findings indicate that while SDPO can accelerate in-domain specialization when teacher signals are stable, it performs poorly in out-of-distribution scenarios and can even cause models to collapse, leading to significant forgetting. In contrast, other on-policy reinforcement learning methods like GRPO adapt more conservatively and better preserve prior knowledge. Analysis suggests that denser self-distillation causes greater drift in both parameter and response spaces, potentially amplifying high-frequency artifacts through a self-reinforcing teacher-student loop. The research concludes that on-policy data alone is insufficient for robust continual learning, and dense self-distillation, while useful for specialization with stable targets, should not be considered a default stabilization strategy. The code is available.

Why it matters

For AI engineers and researchers developing continually learning systems, this work provides crucial insights into the limitations of a popular technique, guiding them towards more robust strategies for preventing catastrophic forgetting and ensuring model stability in dynamic environments.

How to implement this in your domain

  1. 1Re-evaluate current continual learning strategies, particularly those relying heavily on dense self-distillation, for potential stability issues.
  2. 2Explore alternative or complementary on-policy reinforcement learning methods like GRPO for continual post-training.
  3. 3Implement monitoring mechanisms to detect parameter and response space drift during continual learning to prevent model collapse.
  4. 4Investigate hybrid approaches that combine sparse self-distillation with other regularization techniques to balance specialization and knowledge retention.

Who benefits

AI/ML DevelopmentRoboticsAutonomous SystemsSoftware Engineering

Key takeaways

  • Dense on-policy self-distillation can lead to catastrophic forgetting in continual learning.
  • It struggles with out-of-distribution generalization.
  • Other on-policy RL methods may offer better knowledge preservation.
  • On-policy data alone is insufficient for robust continual learning.

Original post by Meng Wang, Haohan Zhao, Wenzhuo Liu, Lu Yang, Geng Liu, Haiyang Guo, Guo-Sen Xie, Gaofeng Meng, Hongbin Liu, Fei Zhu

"arXiv:2607.01763v1 Announce Type: new Abstract: Continual post-training enables foundation models to acquire new knowledge while preserving existing capabilities. Recent work suggests that on-policy learning can mitigate forgetting, with on-policy self-distillation emerging as a…"

View on X

Originally posted by Meng Wang, Haohan Zhao, Wenzhuo Liu, Lu Yang, Geng Liu, Haiyang Guo, Guo-Sen Xie, Gaofeng Meng, Hongbin Liu, Fei Zhu on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses