On-Policy Self-Distillation Limits in Continual Learning Rev

On-Policy Self-Distillation Limits in Continual Learning Revealed

Meng Wang, Haohan Zhao, Wenzhuo Liu, Lu Yang, Geng Liu, Haiyang Guo, Guo-Sen Xie, Gaofeng Meng, Hongbin Liu, Fei Zhu· July 3, 2026 View original

Summary

This research revisits the effectiveness of on-policy self-distillation for continual post-training of foundation models, finding that denser self-distillation can accelerate in-domain specialization but struggles with out-of-distribution scenarios and can even lead to catastrophic forgetting.

Continual post-training is a critical technique for enabling foundation models to acquire new knowledge while retaining existing capabilities. On-policy learning, particularly on-policy self-distillation, has been viewed optimistically as a method to mitigate forgetting. However, this study, using self-distillation policy optimization (SDPO), challenges that view. The findings indicate that while SDPO can accelerate in-domain specialization when teacher signals are stable, it performs poorly in out-of-distribution scenarios and can even cause models to collapse, leading to significant forgetting. In contrast, other on-policy reinforcement learning methods like GRPO adapt more conservatively and better preserve prior knowledge. Analysis suggests that denser self-distillation causes greater drift in both parameter and response spaces, potentially amplifying high-frequency artifacts through a self-reinforcing teacher-student loop. The research concludes that on-policy data alone is insufficient for robust continual learning, and dense self-distillation, while useful for specialization with stable targets, should not be considered a default stabilization strategy. The code is available.

Why it matters

For AI engineers and researchers developing continually learning systems, this work provides crucial insights into the limitations of a popular technique, guiding them towards more robust strategies for preventing catastrophic forgetting and ensuring model stability in dynamic environments.

How to implement this in your domain

1Re-evaluate current continual learning strategies, particularly those relying heavily on dense self-distillation, for potential stability issues.
2Explore alternative or complementary on-policy reinforcement learning methods like GRPO for continual post-training.
3Implement monitoring mechanisms to detect parameter and response space drift during continual learning to prevent model collapse.
4Investigate hybrid approaches that combine sparse self-distillation with other regularization techniques to balance specialization and knowledge retention.

Who benefits

AI/ML DevelopmentRoboticsAutonomous SystemsSoftware Engineering

Key takeaways

Dense on-policy self-distillation can lead to catastrophic forgetting in continual learning.
It struggles with out-of-distribution generalization.
Other on-policy RL methods may offer better knowledge preservation.
On-policy data alone is insufficient for robust continual learning.

Original post by Meng Wang, Haohan Zhao, Wenzhuo Liu, Lu Yang, Geng Liu, Haiyang Guo, Guo-Sen Xie, Gaofeng Meng, Hongbin Liu, Fei Zhu

"arXiv:2607.01763v1 Announce Type: new Abstract: Continual post-training enables foundation models to acquire new knowledge while preserving existing capabilities. Recent work suggests that on-policy learning can mitigate forgetting, with on-policy self-distillation emerging as a…"

View on X

Originally posted by Meng Wang, Haohan Zhao, Wenzhuo Liu, Lu Yang, Geng Liu, Haiyang Guo, Guo-Sen Xie, Gaofeng Meng, Hongbin Liu, Fei Zhu on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses

On-Policy Self-Distillation Limits in Continual Learning Revealed

Why it matters

How to implement this in your domain

Who benefits

Key takeaways

Want to go deeper?

More in AI Engineering & DevTools

Spatial Magic Unveils Camera-Based Movement Gaming for Macbooks

Fable AI Excels in Brainstorming and Intent Understanding

Understanding Multi-Agent Systems: A Comprehensive Guide