On-Policy Distillation Outperforms Offline Learning with Noisy Experts

Ved Sriraman, Peihan Liu, Daniel Hsu, Adam Block· July 1, 2026 View original

Summary

This research explains why online methods like on-policy distillation (OPD) often surpass offline imitation learning (IL) when experts are noisy, especially in language model training. It proves that offline learning from noisy trajectories is fundamentally harder, requiring exponential sample complexity, while OPD achieves polynomial dependence on the horizon.

Imitation Learning (IL) is a fundamental approach for sequential decision-making, notably influencing language model training. A puzzling observation is that online methods, such as on-policy distillation (OPD), frequently outperform offline methods like supervised fine-tuning, despite theoretical claims of offline IL's optimality. This paper offers a theoretical explanation for this discrepancy by introducing a "noisy expert" model. In this model, the learner only has access to an imperfect version of the expert's policy, yet aims to match the performance of a clean, ideal expert. The research demonstrates a stark difference between offline and online IL in this scenario. Offline learning from noisy trajectories is shown to be inherently difficult, requiring sample complexity that grows exponentially with the task horizon to compete with a clean expert. Conversely, online interaction with the noisy expert via a novel variant of OPD achieves polynomial dependence on the horizon. The paper further shows that under specific, natural conditions on expert noise, a horizon-free sample complexity can be obtained, although with some statistical efficiency trade-offs. This analysis provides a theoretical foundation for why OPD is often more effective than supervised fine-tuning when training language models from imperfect teachers.

Why it matters

For professionals developing and training large language models or other AI systems via imitation learning, this research provides critical theoretical insights into why online learning strategies are often superior when dealing with imperfect expert data. It guides the choice of training paradigms for better model performance and efficiency.

How to implement this in your domain

  1. 1Re-evaluate current language model training pipelines, especially those relying solely on offline imitation learning, to consider integrating on-policy distillation.
  2. 2Assess the "noisiness" of expert data used for training and its potential impact on model performance.
  3. 3Experiment with different variants of on-policy distillation to optimize learning from imperfect teachers.
  4. 4Develop strategies for collecting or generating cleaner expert feedback to mitigate the challenges of noisy data.
  5. 5Consider the trade-offs between statistical efficiency and horizon dependence when choosing between offline and online IL methods.

Who benefits

AI DevelopmentEdTechSoftware DevelopmentCustomer ServiceContent Creation

Key takeaways

  • Online methods like on-policy distillation (OPD) are often superior to offline IL with noisy expert data.
  • Offline learning from noisy trajectories requires exponentially more samples.
  • OPD can achieve polynomial dependence on the task horizon.
  • This provides a theoretical basis for effective language model training from imperfect teachers.

Original post by Ved Sriraman, Peihan Liu, Daniel Hsu, Adam Block

"arXiv:2606.30923v1 Announce Type: new Abstract: Imitation Learning is a natural framework for learning in sequential decision-making systems and has emerged as the dominant paradigm through which we understand language model training. A central puzzle is that, while in theory off…"

View on X

Originally posted by Ved Sriraman, Peihan Liu, Daniel Hsu, Adam Block on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses