RODS Synthesizes Data for Efficient Multi-Turn Tool-Use AI Training

Ruishan Fang, Siyuan Lu, Chenyi Zhuang, Tao Lin· June 18, 2026 View original

Summary

Researchers propose RODS (Reward-driven Online Data Synthesis), a novel method that addresses the depletion of informative samples in multi-turn tool-use reinforcement learning. RODS continuously identifies samples near an agent's capability boundary using reward variance, synthesizes new structurally complex variants, and manages a dynamic replay buffer, achieving comparable performance to much larger offline datasets with significantly fewer trajectories.

A new research paper introduces RODS (Reward-driven Online Data Synthesis), a novel framework designed to overcome a significant bottleneck in training multi-turn tool-use reinforcement learning (RL) agents. A common challenge is the rapid exhaustion of truly informative training samples within static datasets, as agents improve and their "capability boundary" shifts. The researchers observed that policy gradients are strongest for tasks where an agent is on the cusp of success or failure. RODS leverages this insight by using the variance in reward as a cost-free indicator to identify these critical "boundary samples" during training rollouts. Once identified, RODS synthesizes new, structurally similar multi-turn task variants, matching the complexity of these boundary samples. This dynamic data generation, combined with a co-evolving replay buffer, allows RODS to achieve performance comparable to much larger offline datasets using significantly fewer training trajectories, making RL training more efficient and effective.

Why it matters

This research is highly valuable for developers building complex AI agents that interact with tools or APIs, as it offers a more efficient and scalable way to train them, reducing the need for vast, static datasets and accelerating development cycles.

How to implement this in your domain

  1. 1Implement reward variance as a metric to identify informative samples in RL training for tool-use agents.
  2. 2Develop a data synthesis pipeline to generate new training examples based on the structural complexity of boundary samples.
  3. 3Integrate a dynamic replay buffer that adapts and co-evolves with the agent's policy during training.
  4. 4Apply RODS principles to reduce the reliance on large, static datasets for multi-turn agent training.

Who benefits

AI/ML EngineeringSoftware DevelopmentBusiness Process AutomationRoboticsCustomer Service

Key takeaways

  • RODS addresses informative sample depletion in multi-turn tool-use RL.
  • It uses reward variance to identify critical "boundary samples."
  • New data variants are synthesized online, matching structural complexity.
  • RODS significantly reduces the number of trajectories needed for training.

Original post by Ruishan Fang, Siyuan Lu, Chenyi Zhuang, Tao Lin

"arXiv:2606.19047v1 Announce Type: new Abstract: Multi-turn tool-use RL is bottlenecked by the rapid depletion of informative samples in static datasets. We observe that the gradient signal in GRPO concentrates on tasks with the highest rollout reward variance, a consequence of th…"

View on X

Originally posted by Ruishan Fang, Siyuan Lu, Chenyi Zhuang, Tao Lin on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses