New Method Addresses Diversity Collapse in LLM Reinforcement

New Method Addresses Diversity Collapse in LLM Reinforcement Learning

Suqin Yuan, Jinkun Chen, Jiyang Zheng, Muyang Li, Lei Feng, Dadong Wang, Tao Xiang, Tongliang Liu, Bo An· June 16, 2026 View original

Summary

Research formalizes diversity collapse in Reinforcement Learning with Verifiable Rewards (RLVR) for LLMs as overtraining and introduces Bayesian Boundary Gating (BBG) to improve reasoning diversity across various benchmarks.

Reinforcement Learning with Verifiable Rewards (RLVR) has become a prominent technique for enhancing the reasoning capabilities of large language models (LLMs). However, a significant challenge known as "diversity collapse" often arises. This phenomenon is characterized by improvements in Pass@1 (single-shot success) while Pass@k for higher k values degrades, suggesting a narrowing of the model's overall reasoning boundary. This paper formalizes diversity collapse through the lens of overtraining. It posits that once a problem's contribution to the reference metric has effectively saturated, further updates, especially with limited rollouts per problem, no longer expand the model's ability to solve new problems. Instead, these updates merely concentrate probability mass on trajectories already favored by on-policy sampling, thus structurally biasing against high-k Pass@k. To mitigate this, the researchers propose Bayesian Boundary Gating (BBG). This method redirects optimization away from overtraining by estimating each problem's marginal contribution to the reasoning boundary. Empirical results show that restricting updates to problems with zero observed success can lift Pass@256 above the base model on difficult benchmarks. BBG, building on these insights, significantly improves average Pass@k across a wide range of k values on multiple reasoning benchmarks, demonstrating its effectiveness in fostering greater reasoning diversity.

Why it matters

Addressing diversity collapse is critical for developing more robust, versatile, and reliable LLMs that can explore a broader spectrum of reasoning paths, which is essential for complex problem-solving and real-world applications.

How to implement this in your domain

1Analyze existing RLVR training pipelines for LLMs to identify instances of diversity collapse in reasoning tasks.
2Implement strategies to selectively update models, focusing on problems that have not yet been successfully solved.
3Explore integrating Bayesian Boundary Gating (BBG) into your LLM fine-tuning and reinforcement learning workflows.
4Monitor and evaluate LLM performance using a wide range of Pass@k metrics to assess reasoning diversity comprehensively.
5Adapt training methodologies to prioritize the expansion of the model's reasoning boundary over mere optimization for single-shot success.

Who benefits

AI DevelopmentSoftware EngineeringResearch & DevelopmentEducation

Key takeaways

Diversity collapse in RLVR for LLMs is primarily caused by overtraining.
Overtraining leads to a narrowing of the model's reasoning boundary.
Restricting updates to unsolved problems can improve reasoning diversity.
Bayesian Boundary Gating (BBG) is a new method to mitigate diversity collapse and enhance LLM reasoning.

Original post by Suqin Yuan, Jinkun Chen, Jiyang Zheng, Muyang Li, Lei Feng, Dadong Wang, Tao Xiang, Tongliang Liu, Bo An

"arXiv:2606.15455v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a key approach for enhancing the reasoning abilities of large language models. However, RLVR often suffers from \emph{diversity collapse}: Pass@$1$ improves while high…"

View on X

Originally posted by Suqin Yuan, Jinkun Chen, Jiyang Zheng, Muyang Li, Lei Feng, Dadong Wang, Tao Xiang, Tongliang Liu, Bo An on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses

New Method Addresses Diversity Collapse in LLM Reinforcement Learning

Why it matters

How to implement this in your domain

Who benefits

Key takeaways

Want to go deeper?

More in AI Research

VISReg Enhances JEPA Training with Novel Regularization

Margaret Atwood Criticizes AI for "Garbage In, Garbage Out" Flaw

Podcast Explores Large Test-Time Compute and AI Model Budgets