C3RL Improves LLM Confidence Calibration for Adaptive Scaling

Xuqing Yang, Yi Yuan, Shanzhe Lei, Xuhong Wang· July 3, 2026 View original

Summary

C3RL, a novel reinforcement learning algorithm, enhances large language model (LLM) calibration by integrating correctness, calibration, and dataset-informed reference accuracy rewards. This leads to better-calibrated confidence without sacrificing accuracy, enabling an adaptive test-time scaling strategy (CAS) that reduces inference budget by up to 12.33 times.

This research addresses a significant issue in large language models (LLMs) trained with reinforcement learning (RL): while RL improves performance, it often leads to poor calibration between an LLM's expressed confidence and its actual accuracy. This can result in overconfident hallucinations when the model is uncertain. To counter this, the authors propose Correctness and Confidence Calibration Reinforcement Learning (C3RL). C3RL is a novel RL algorithm that incorporates a multi-faceted reward system, incentivizing not only correctness but also accurate confidence expression and adherence to dataset-informed reference accuracy. Comprehensive evaluations across eight text and multimodal datasets demonstrate that C3RL significantly improves calibration without compromising the model's performance, outperforming existing state-of-the-art methods in both metrics. Leveraging the well-calibrated verbalized confidence from C3RL, the researchers also introduce Confidence-based Adaptive Test Time Scaling (CAS). This inference-time strategy dynamically allocates computational resources based on the model's confidence in its response. Experiments show that CAS surpasses majority voting in performance while dramatically reducing the inference budget by up to 12.33 times, paving the way for more reliable and resource-efficient LLM deployments.

Why it matters

AI product managers and engineers can deploy more reliable and cost-effective LLMs by ensuring models accurately express their confidence, leading to better user trust and optimized resource utilization.

How to implement this in your domain

  1. 1Integrate confidence calibration metrics into the training and evaluation pipelines for LLMs.
  2. 2Explore using multi-objective reinforcement learning to incentivize both correctness and confidence calibration in model training.
  3. 3Develop adaptive inference strategies that dynamically adjust computational resources based on an LLM's verbalized confidence.
  4. 4Prioritize LLM models that demonstrate strong calibration for deployment in high-stakes applications.

Who benefits

AI DevelopmentSoftware as a Service (SaaS)Customer ServiceFinanceHealthcare

Key takeaways

  • LLMs often suffer from poor confidence calibration despite high accuracy.
  • C3RL improves LLM calibration by rewarding correctness, calibration, and reference accuracy.
  • Well-calibrated confidence enables adaptive test-time scaling (CAS) to reduce inference costs.
  • CAS can significantly cut inference budgets while maintaining or improving performance.

Original post by Xuqing Yang, Yi Yuan, Shanzhe Lei, Xuhong Wang

"arXiv:2607.01612v1 Announce Type: new Abstract: Training large language models (LLMs) with reinforcement learning (RL) has significantly advanced their performance on reasoning and question-answering tasks. However, prevailing RL reward designs typically prioritize response corre…"

View on X

Originally posted by Xuqing Yang, Yi Yuan, Shanzhe Lei, Xuhong Wang on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses