ATOD Boosts Multi-Turn Agent Training with Hybrid Distillation
Summary
This paper introduces ATOD (Annealed Turn-aware On-policy Distillation), a hybrid online distillation algorithm for training small language-model agents in long-horizon interactive tasks. ATOD combines an annealed OPD-RL schedule with Turn-level Disagreement-Uncertainty Reweighting (T-DUR) to achieve faster imitation and higher reward-driven improvement, outperforming existing baselines.
Why it matters
For professionals building conversational AI, virtual assistants, or autonomous agents, ATOD offers a more efficient and effective way to train smaller, high-performing models. This can lead to more capable agents with reduced computational costs and faster development cycles.
How to implement this in your domain
- 1Adopt a hybrid training approach for autonomous agents, combining imitation learning (like OPD) with reinforcement learning.
- 2Implement an annealed schedule that transitions from heavy imitation in early stages to increased reward-driven exploration later.
- 3Develop a mechanism like Turn-level Disagreement-Uncertainty Reweighting to prioritize high-impact turns for supervision.
- 4Evaluate the performance of smaller language models trained with ATOD against larger teacher models for cost-efficiency and deployment.
Who benefits
Key takeaways
- ATOD combines on-policy distillation and reinforcement learning for efficient agent training.
- An annealed schedule allows for fast imitation followed by reward-driven exploration.
- Turn-level reweighting improves dense supervision in long, multi-turn interactions.
- ATOD consistently outperforms baselines and even teacher models in success rate.
Original post by Qitai Tan, Zefang Zong, Yang Li, Peng Chen
"arXiv:2606.27814v1 Announce Type: new Abstract: Training small language-model agents for long-horizon interactive tasks requires both fast imitation and reward-driven improvement. On-policy distillation (OPD) provides dense teacher guidance and typically improves rapidly in the e…"
View on XOriginally posted by Qitai Tan, Zefang Zong, Yang Li, Peng Chen on X · view source
Want to go deeper?
Turn these trends into skills with Learnijoy's hands-on AI & tech courses.
Explore coursesMore in AI Engineering & DevTools

Sky Pro Cloud Rendering Optimized, Cost Cut by 50%
An upcoming Sky Pro update significantly reduces cloud rendering costs by 50% through texture consolidation and introduces more intuitive cloud shape controls. The new controls allow independent erosion strength adjustments for cloud tops and bottoms, improving visual quality and ease of use.
Popping the GPU Bubble
The piece discusses the current high demand and pricing for GPUs, suggesting that the market might be nearing a point of correction or saturation.

LongCat-2.0 Model Launching Soon on Hugging Face
The LongCat-2.0 model is expected to be released shortly on the Hugging Face platform, making it accessible to developers and researchers.