ATOD Boosts Multi-Turn Agent Training with Hybrid Distillation

Qitai Tan, Zefang Zong, Yang Li, Peng Chen· June 29, 2026 View original

Summary

This paper introduces ATOD (Annealed Turn-aware On-policy Distillation), a hybrid online distillation algorithm for training small language-model agents in long-horizon interactive tasks. ATOD combines an annealed OPD-RL schedule with Turn-level Disagreement-Uncertainty Reweighting (T-DUR) to achieve faster imitation and higher reward-driven improvement, outperforming existing baselines.

Training small language models to act as autonomous agents in complex, multi-turn interactive tasks presents a dilemma: On-policy Distillation (OPD) offers fast initial learning but quickly plateaus, while Reinforcement Learning (RL) can achieve higher performance ceilings but is slow to learn due to sparse feedback. This research introduces ATOD (Annealed Turn-aware On-policy Distillation), a novel hybrid algorithm designed to leverage the strengths of both. ATOD employs an annealed schedule, where OPD dominates early training to quickly mimic a teacher's behavior, and then RL is gradually increased to drive exploration and optimize for environmental rewards. This allows the agent to benefit from dense supervision initially and then push beyond the teacher's performance. Furthermore, ATOD incorporates Turn-level Disagreement-Uncertainty Reweighting (T-DUR). This mechanism intelligently amplifies the importance of high-utility turns within long trajectories, providing more effective and targeted supervision. Experiments across various benchmarks demonstrate that ATOD consistently surpasses both pure OPD and RL methods, and even outperforms the teacher models, achieving significant improvements in success rates for multi-turn agents.

Why it matters

For professionals building conversational AI, virtual assistants, or autonomous agents, ATOD offers a more efficient and effective way to train smaller, high-performing models. This can lead to more capable agents with reduced computational costs and faster development cycles.

How to implement this in your domain

  1. 1Adopt a hybrid training approach for autonomous agents, combining imitation learning (like OPD) with reinforcement learning.
  2. 2Implement an annealed schedule that transitions from heavy imitation in early stages to increased reward-driven exploration later.
  3. 3Develop a mechanism like Turn-level Disagreement-Uncertainty Reweighting to prioritize high-impact turns for supervision.
  4. 4Evaluate the performance of smaller language models trained with ATOD against larger teacher models for cost-efficiency and deployment.

Who benefits

Customer ServiceEdTechGamingVirtual AssistantsSoftware Development

Key takeaways

  • ATOD combines on-policy distillation and reinforcement learning for efficient agent training.
  • An annealed schedule allows for fast imitation followed by reward-driven exploration.
  • Turn-level reweighting improves dense supervision in long, multi-turn interactions.
  • ATOD consistently outperforms baselines and even teacher models in success rate.

Original post by Qitai Tan, Zefang Zong, Yang Li, Peng Chen

"arXiv:2606.27814v1 Announce Type: new Abstract: Training small language-model agents for long-horizon interactive tasks requires both fast imitation and reward-driven improvement. On-policy distillation (OPD) provides dense teacher guidance and typically improves rapidly in the e…"

View on X

Originally posted by Qitai Tan, Zefang Zong, Yang Li, Peng Chen on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses