Blockwise Gating Improves On-Policy Distillation Robustness

Liwen Zheng, Haiyun Jiang· June 24, 2026 View original

Summary

Researchers introduce blockwise policy-drift gating, a lightweight method for on-policy distillation (OPD) that reweights position losses based on log-probability shifts between student policies. This technique significantly improves the robustness and solve rates of student models on long-horizon reasoning tasks.

On-policy distillation (OPD) is a technique where a student policy learns from a teacher's signals generated on trajectories sampled by the student itself. However, for complex, long-horizon reasoning tasks, sampled-token OPD can be fragile, leading to inconsistent performance. This paper proposes "blockwise policy-drift gating" to enhance the robustness of OPD, particularly when reusing rollouts. The method involves calculating log-probability shifts between the behavior student policy (which sampled the data) and the current student policy. These shifts are then aggregated over fixed blocks of tokens and used as detached, mean-normalized gates to reweight the OPD position losses. Crucially, this technique does not alter the teacher's targets or the rollout policy. In evaluations on a Qwen3 math reasoning benchmark, fixed 64-token block gating improved the mean pass@8 score for sampled-token OPD from 0.4978 to 0.5160 across several challenging math datasets. The results highlight that controlling local old-current policy drift is a practical signal for improving solve-rate robustness in OPD.

Why it matters

This research offers a practical and lightweight method to improve the stability and performance of on-policy distillation, making it more effective for training student models on complex, long-horizon reasoning tasks.

How to implement this in your domain

  1. 1Integrate blockwise policy-drift gating into your on-policy distillation pipelines.
  2. 2Experiment with different block sizes to optimize performance for your specific tasks.
  3. 3Apply this technique to improve the robustness of student models on long-horizon reasoning challenges.
  4. 4Benchmark the solve-rate improvements achieved by using blockwise gating in your models.
  5. 5Consider this method for training smaller, more efficient student models from larger teacher models.

Who benefits

AI/ML DevelopmentEducation TechnologySoftware EngineeringRoboticsGenerative AI

Key takeaways

  • Blockwise policy-drift gating improves on-policy distillation (OPD) robustness.
  • It reweights position losses based on log-probability shifts between student policies.
  • The method is lightweight and does not alter teacher targets or rollout policies.
  • It significantly enhances solve rates for long-horizon reasoning tasks.

Original post by Liwen Zheng, Haiyun Jiang

"arXiv:2606.24084v1 Announce Type: new Abstract: On-policy distillation (OPD) trains a student policy using teacher signals computed on trajectories sampled by the student itself. Recent work shows that sampled-token OPD can be fragile on long-horizon reasoning tasks and that loca…"

View on X

Originally posted by Liwen Zheng, Haiyun Jiang on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses