New Method Calibrates RLHF Reward Models with Per-Rater Shrinkage

Arnav Raj· June 29, 2026 View original

Summary

This research introduces PEBS, a post-hoc empirical-Bayes shrinkage estimator that fits individual affine calibrators for each human annotator in RLHF reward models. It significantly reduces within-user RMSE by addressing the issue of a single global calibration failing to account for individual rater differences.

Reinforcement Learning from Human Feedback (RLHF) relies on reward models trained from human preferences. A common challenge is that these models often use a single, global calibration across all annotators, which fails to account for systematic differences in how individual raters use rating scales. This can lead to inaccuracies because no single average fit truly represents any individual annotator's behavior. A new method, PEBS (Per-rater Empirical-Bayes Shrinkage), addresses this by introducing a post-hoc estimator. It fits unique affine calibrators for each annotator using a small, held-out portion of their ratings. These individual calibrators are then refined using Morris-James-Stein empirical-Bayes shrinkage, pulling them towards the population mean without retraining the core reward model. Evaluations on datasets like PRISM and PluriHarms showed that PEBS significantly improves accuracy, reducing within-user held-out RMSE by over 8.5% compared to the standard population-slope baseline. This approach offers a closed-form solution for annotator-specific calibration, enhancing the reliability of RLHF systems by better accounting for human variability.

Why it matters

Professionals building or deploying RLHF systems can achieve more accurate and reliable reward models by accounting for individual human annotator biases, leading to better-performing AI agents.

How to implement this in your domain

  1. 1Integrate PEBS as a post-processing step for existing RLHF reward models to refine calibration.
  2. 2Allocate a small held-out dataset for each annotator to train their specific affine calibrators.
  3. 3Apply empirical-Bayes shrinkage to individual calibrators to balance personalization with overall population trends.
  4. 4Monitor the reduction in within-user RMSE to quantify the improvement in reward model accuracy.

Who benefits

AI DevelopmentContent ModerationCustomer ServiceAutonomous Systems

Key takeaways

  • RLHF reward models often suffer from inaccurate global calibration due to individual rater differences.
  • PEBS introduces a per-rater empirical-Bayes shrinkage method to calibrate reward models post-hoc.
  • This approach significantly reduces within-user RMSE without requiring reward model retraining.
  • Improved calibration leads to more reliable and accurate AI systems trained with human feedback.

Original post by Arnav Raj

"arXiv:2606.27578v1 Announce Type: new Abstract: Reward models for Reinforcement Learning from Human Feedback (RLHF) pool preferences across thousands of annotators and fit one global affine calibrator, collapsing raters with systematically different rating-scale offsets and slope…"

View on X

Originally posted by Arnav Raj on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses