New Method Improves RLHF Stability with Uncertainty-Aware Reward Models

Licheng Pan, Haocheng Yang, Haoxuan Li, Yichen Sun, Yunsheng Lu, Shijian Wang, Lei Shen, Yuan Lu, Zhixuan Chu, Hao Wang· June 19, 2026 View original

Summary

A new approach called Uncertainty-Aware Reward Modeling (UARM) enhances Reinforcement Learning from Human Feedback (RLHF) by equipping reward models with calibrated uncertainty estimates and reweighting policy optimization advantages. This mitigates reward hacking and improves alignment quality by preventing unreliable reward signals from disproportionately influencing policy updates.

Reinforcement Learning from Human Feedback (RLHF) is a critical technique for aligning large language models (LLMs) with human preferences. This process typically involves training reward models on human preference data and then optimizing LLM policies to maximize these predicted rewards. However, this pipeline faces two significant challenges that can lead to instability. First, standard reward models often act as deterministic point estimators, unable to signal when their predictions are unreliable. Second, modern group-based policy optimization methods, such as GRPO, treat all reward signals uniformly during advantage computation, potentially amplifying unreliable estimates. As LLM policies explore a wider range of responses, these limitations create a vulnerability where inaccurate reward estimates can exert undue influence, leading to a phenomenon known as reward hacking. This occurs when the model learns to exploit flaws in the reward function rather than genuinely aligning with human intent. To address these issues, researchers propose Uncertainty-Aware Reward Modeling (UARM). UARM enhances reward models by providing them with calibrated uncertainty estimates, achieved through quantile-based conformal prediction. Furthermore, it reweights GRPO advantages using heteroscedastic variance decomposition. Experiments conducted on datasets like HelpSteer, UltraFeedback, and PKU-SafeRLHF demonstrate that UARM substantially improves reward model calibration, effectively reduces reward hacking, and ultimately enhances the quality of downstream alignment compared to standard GRPO and other uncertainty-agnostic baselines.

Why it matters

AI developers and researchers building and deploying large language models can use UARM to create more robust and trustworthy AI systems, reducing the risk of reward hacking and ensuring better alignment with human values and intentions. This is crucial for reliable and safe AI applications.

How to implement this in your domain

  1. 1Adopt Uncertainty-Aware Reward Modeling (UARM) in RLHF pipelines for training large language models.
  2. 2Implement quantile-based conformal prediction to equip reward models with calibrated uncertainty estimates.
  3. 3Modify policy optimization algorithms like GRPO to incorporate heteroscedastic variance decomposition for reweighting advantages.
  4. 4Benchmark UARM's performance against existing RLHF methods on custom datasets to validate improved alignment and reduced reward hacking.

Who benefits

AI DevelopmentSoftware DevelopmentCustomer ServiceContent CreationEducation

Key takeaways

  • UARM improves RLHF by providing reward models with calibrated uncertainty estimates.
  • It reweights policy optimization advantages, preventing unreliable signals from causing reward hacking.
  • The method significantly enhances reward model calibration and downstream alignment quality.
  • UARM contributes to building more stable and trustworthy large language models.

Original post by Licheng Pan, Haocheng Yang, Haoxuan Li, Yichen Sun, Yunsheng Lu, Shijian Wang, Lei Shen, Yuan Lu, Zhixuan Chu, Hao Wang

"arXiv:2606.19818v1 Announce Type: new Abstract: Reinforcement learning from human feedback (RLHF) aligns large language models by training reward models on preference data and optimizing policies to maximize predicted rewards. However, this pipeline faces two fundamental challeng…"

View on X

Originally posted by Licheng Pan, Haocheng Yang, Haoxuan Li, Yichen Sun, Yunsheng Lu, Shijian Wang, Lei Shen, Yuan Lu, Zhixuan Chu, Hao Wang on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses