PHF Improves LLM Reasoning by Distilling Teacher's Internal States

Yuhan Li, Mingxu Zhang, Dazhong Shen, Ying Sun· June 30, 2026 View original

Summary

Researchers propose Privileged Hidden Flow (PHF), a new method for on-policy self-distillation (OPSD) that enhances LLM reasoning. PHF distills the internal hidden states and trajectory geometry of a privileged teacher model, leading to significant performance gains over existing OPSD baselines.

On-policy self-distillation (OPSD) is a technique used to train reasoning models by having them learn from their own generated outputs, guided by a "privileged teacher" that has access to correct solutions. Current OPSD methods primarily supervise only the output distribution, meaning the teacher's internal computational process isn't directly leveraged. A new approach, Privileged Hidden Flow (PHF), addresses this by additionally distilling how a privileged teacher's hidden states evolve along the same generated sequence. Instead of forcing exact hidden state matches, PHF aligns the token-to-token transition directions and the overall trajectory geometry of the hidden states. This method, which includes an all-layer recipe and adjacent-layer relations, consistently improves aggregate performance across various Qwen models, demonstrating a more effective way to transfer internal reasoning capabilities from a teacher to a student model.

Why it matters

This research provides a more effective way to train smaller, more efficient LLMs to mimic the complex reasoning processes of larger, more capable models, leading to better performance with fewer resources.

How to implement this in your domain

  1. 1Investigate integrating PHF into existing self-distillation or knowledge distillation pipelines for LLMs.
  2. 2Experiment with PHF to improve the reasoning capabilities of smaller LLMs for specific tasks.
  3. 3Evaluate the trade-offs between computational cost and performance gains when applying PHF.
  4. 4Consider using PHF for transferring complex reasoning patterns from proprietary large models to more accessible open-source alternatives.

Who benefits

AI EngineeringSoftware DevelopmentResearch & DevelopmentCloud ComputingData Science

Key takeaways

  • PHF enhances on-policy self-distillation by leveraging a teacher's internal hidden states.
  • It aligns hidden state transition directions and trajectory geometry, not just output distributions.
  • PHF consistently improves reasoning performance across different LLM sizes.
  • This method offers a more effective way to transfer complex reasoning from teacher to student models.

Original post by Yuhan Li, Mingxu Zhang, Dazhong Shen, Ying Sun

"arXiv:2606.29340v1 Announce Type: new Abstract: On-policy self-distillation (OPSD) trains a reasoning model on rollouts sampled from its own policy by matching a privileged teacher that also sees verified reference solutions. Existing OPSD objectives supervise only the output dis…"

View on X

Originally posted by Yuhan Li, Mingxu Zhang, Dazhong Shen, Ying Sun on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses