RefGRPO Closes Agent Reflection Gap, Improves RL Performance.

Yinglun Zhu· June 15, 2026 View original

Summary

This paper introduces RefGRPO, a method that enhances LLM agents' ability to accurately assess their own performance after observing environmental feedback, addressing a "reflection gap." It augments standard Reinforcement Learning with a free calibration bonus and a dynamic schedule, improving both reflection calibration and task accuracy without needing external reward models.

Large Language Models increasingly function as agents interacting with environments and receiving feedback, such as execution results or error messages. A critical capability for these agents is to accurately self-assess their performance based on this feedback. However, a "reflection gap" often exists, where LLM agents misjudge their own outputs even after observing concrete outcomes, and standard Reinforcement Learning (RL) struggles to correct this due to credit-assignment issues. To bridge this gap, the researchers propose RefGRPO, a straightforward yet effective enhancement for standard RL algorithms. RefGRPO incorporates two main components: a "free calibration bonus" that is calculated by comparing the agent's self-reflection with the actual environmental outcome, requiring no additional reward models or external annotations; and a dynamic schedule for this bonus's coefficient. Compared to conventional RL baselines, RefGRPO simultaneously boosts reflection calibration, significantly reducing underconfidence rates, and improves task accuracy across various text-to-SQL benchmarks. This calibrated reflection transforms the agent into a self-verifier grounded in environmental feedback, enabling better self-improvement without outcome supervision and more effective selective prediction during testing.

Why it matters

For professionals developing and deploying autonomous AI agents, particularly in domains requiring high reliability and self-correction, RefGRPO offers a practical way to make agents more trustworthy and efficient. Improving an agent's ability to accurately assess its own performance is crucial for robust real-world applications and reducing the need for constant human oversight.

How to implement this in your domain

  1. 1Integrate RefGRPO's calibration bonus into existing Reinforcement Learning pipelines for agent training.
  2. 2Develop mechanisms for agents to generate and compare self-reflections with actual environmental outcomes.
  3. 3Implement dynamic scheduling for calibration coefficients to optimize agent learning and self-assessment.
  4. 4Utilize calibrated agent reflections as pseudo-rewards for self-improvement or for selective prediction in production systems.

Who benefits

AI DevelopmentRoboticsSoftware EngineeringCustomer ServiceAutonomous Systems

Key takeaways

  • LLM agents often mis-assess their performance despite environmental feedback.
  • RefGRPO introduces a "free calibration bonus" to close this reflection gap.
  • It improves both reflection calibration and task accuracy in agents.
  • Calibrated reflection enables better self-improvement and selective prediction.

Original post by Yinglun Zhu

"arXiv:2606.14211v1 Announce Type: new Abstract: LLMs are increasingly deployed as agents that interact with external environments and observe feedback such as execution results, error messages, and tool outputs. A well-functioning agent should be able to leverage this feedback to…"

View on X

Originally posted by Yinglun Zhu on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses