BV-Blend Stabilizes Critic-Free Reinforcement Learning for LLMs

Yupeng Chang, Yuan Wu, Yi Chang· June 30, 2026 View original

Summary

BV-Blend is a new critic-free reinforcement learning framework that enhances training stability and performance for aligning large language models by blending prompt-local statistics with historical, cluster-conditioned reward moments. This addresses issues like zero advantage estimation in cold-start regimes.

This paper introduces BV-Blend, a novel framework designed to stabilize critic-free reinforcement learning (RL) for large language models (LLMs), particularly in scenarios involving verifiable rewards. Critic-free methods, like Group Relative Policy Optimization (GRPO), aim to reduce computational overhead by avoiding a value function, but can suffer from instability when local reward statistics are uniform, leading to zero advantage estimates and stalled learning. BV-Blend tackles this by combining immediate, prompt-local on-policy statistics with historical reward moments, which are tracked for semantic clusters. It uses a confidence weight, derived from a standard error of the mean proxy, to intelligently blend these historical and local baselines and variance statistics. This blended approach yields a more robust, standardized advantage for PPO-style updates. Experiments on verifiable reasoning benchmarks demonstrate that BV-Blend significantly improves training stability and overall performance, proving resilient where other group-normalized methods might fail.

Why it matters

For professionals developing or deploying LLMs, BV-Blend offers a more stable and efficient method for aligning models with desired behaviors, especially in critical applications requiring verifiable rewards, potentially reducing training costs and improving model reliability.

How to implement this in your domain

  1. 1Investigate integrating BV-Blend's advantage estimation technique into your existing critic-free RL pipelines for LLM alignment.
  2. 2Experiment with semantic clustering of prompts to leverage historical reward moments effectively in your training data.
  3. 3Evaluate the stability and performance gains of BV-Blend compared to current GRPO-style methods on your specific LLM alignment tasks.
  4. 4Consider applying the uncertainty-weighted blending approach to other areas of RL where baseline estimation is a challenge.

Who benefits

AI DevelopmentSoftware EngineeringResearch & DevelopmentCustomer Service

Key takeaways

  • BV-Blend stabilizes critic-free RL for LLMs by blending local and historical reward statistics.
  • It addresses instability issues like zero advantage in cold-start scenarios.
  • The framework uses uncertainty weighting for robust advantage estimation.
  • BV-Blend improves training stability and performance in verifiable reasoning tasks.

Original post by Yupeng Chang, Yuan Wu, Yi Chang

"arXiv:2606.28707v1 Announce Type: new Abstract: Critic-free reinforcement learning with verifiable rewards (RLVR), exemplified by Group Relative Policy Optimization (GRPO), avoids training a value function (critic) and reduces memory and compute overhead relative to critic-based…"

View on X

Originally posted by Yupeng Chang, Yuan Wu, Yi Chang on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses