New Critic-Free RL Method Improves Data Efficiency and Stability

Yihong Wu, Liheng Ma, Lingfeng Xiao, Muzhi Li, Xinyu Wang, Yingxue Zhang, Jian-Yun Nie· June 17, 2026 View original

Summary

This research introduces negative token filtering, a novel strategy for critic-free reinforcement learning that enables stable single-rollout training. It addresses data inefficiency and synchronization issues found in traditional group-based RL methods, achieving comparable or stronger performance on reasoning and agentic tasks.

Traditional critic-free reinforcement learning (RL) approaches often rely on generating multiple rollouts for a single query to establish value baselines. This method, however, can be inefficient with data, introduce synchronization challenges, and lack flexibility when dealing with structured rollouts. A new study re-examines the fundamental purpose of these "groups," concluding that their primary role is to prevent incorrect penalties on negative samples rather than solely estimating baselines. Based on this insight, researchers propose "negative token filtering." This filtering technique allows for stable training using just a single rollout, significantly improving data efficiency. When applied to existing batch-level advantage methods, it demonstrates performance on par with or superior to group-based RL techniques in reasoning and agentic tasks.

Why it matters

Professionals working with large language models and reinforcement learning can leverage this method to develop more efficient and stable training pipelines, reducing computational costs and accelerating model development.

How to implement this in your domain

  1. 1Investigate integrating negative token filtering into existing critic-free RL frameworks for LLM post-training.
  2. 2Benchmark the performance and data efficiency of single-rollout training against current group-based methods.
  3. 3Adapt the technique for specific agentic or reasoning tasks to evaluate its impact on model capabilities.
  4. 4Explore how this method could simplify distributed RL training by removing group synchronization barriers.

Who benefits

AI DevelopmentSoftware EngineeringResearch & DevelopmentRobotics

Key takeaways

  • Traditional group-based critic-free RL methods face data inefficiency and synchronization issues.
  • The core function of "groups" is to prevent false penalties on negative samples.
  • Negative token filtering enables stable single-rollout training, improving efficiency.
  • This new method performs comparably or better than group-based techniques on various tasks.

Original post by Yihong Wu, Liheng Ma, Lingfeng Xiao, Muzhi Li, Xinyu Wang, Yingxue Zhang, Jian-Yun Nie

"arXiv:2606.17250v1 Announce Type: new Abstract: Reinforcement learning (RL) has become a central paradigm for post-training large language models. Existing critic-free RL methods typically generate a group of rollouts for the same question to estimate value baselines for advantag…"

View on X

Originally posted by Yihong Wu, Liheng Ma, Lingfeng Xiao, Muzhi Li, Xinyu Wang, Yingxue Zhang, Jian-Yun Nie on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses