New Metric Improves LLM Reinforcement Learning with Verifiable Rewards.

Outongyi Lv, Yanzhao Zheng, Yuanwei Zhang, Zhenghao Huang, Xingjun Wang, Baohua Dong, Hangcheng Zhu, Yingda Chen· July 1, 2026 View original

▶ The 2-minute explainer

Summary

This research introduces the Relative Surprisal Index (RSI), an information-theoretic metric for adaptive token selection in Reinforcement Learning with Verifiable Rewards (RLVR) for LLMs. RSI-S, an entropy-adaptive filtering method based on RSI, improves reasoning accuracy by 2-3 percentage points by retaining tokens within a stable surprisal interval.

Reinforcement Learning (RL) is increasingly vital for enhancing Large Language Models (LLMs) beyond simple imitation, particularly through paradigms like RL with Verifiable Rewards (RLVR) which aims to improve reasoning. However, there's been conflicting advice on which tokens to prioritize during RLVR training: some suggest high-entropy tokens, while others caution against low-probability tokens. Both approaches have shown empirical success, creating a tension in understanding optimal policy optimization dynamics. To resolve this, the researchers propose the Relative Surprisal Index (RSI), a principled, information-theoretic metric that effectively combines a token's entropy with its selected probability. RSI is shown to relate to the local ratio between logit-gradient norm and predictive entropy variations. Building on RSI, they introduce RSI Selection (RSI-S), an entropy-adaptive token filtering method. RSI-S intelligently filters out both redundant low-surprisal tokens and unstable high-surprisal tail tokens, reconciling previous contradictory paradigms. Empirical evaluations on AIME and AMC benchmarks demonstrate that RSI-S consistently improves average accuracy by 2-3 percentage points over existing methods like GRPO across various model scales, offering a promising new perspective for RLVR enhancement.

Why it matters

This research provides a more effective and principled way to train LLMs using reinforcement learning, leading to significant improvements in reasoning capabilities. Professionals developing or fine-tuning LLMs can use RSI to optimize training and achieve better performance.

How to implement this in your domain

  1. 1Integrate the Relative Surprisal Index (RSI) into your LLM fine-tuning pipelines using RLVR.
  2. 2Implement RSI Selection (RSI-S) to filter tokens during training, focusing on those within a stable surprisal interval.
  3. 3Experiment with different RSI thresholds to optimize LLM performance for specific reasoning tasks.
  4. 4Analyze token surprisal and entropy during LLM training to gain deeper insights into model learning dynamics.

Who benefits

AI DevelopmentSoftware EngineeringResearch & DevelopmentEdTech (AI tutors)LegalTech (AI assistants)

Key takeaways

  • The Relative Surprisal Index (RSI) is a new metric for adaptive token selection in RLVR for LLMs.
  • RSI reconciles conflicting views on prioritizing high-entropy vs. low-probability tokens.
  • RSI Selection (RSI-S) filters tokens within a stable RSI interval, improving reasoning accuracy.
  • RSI-S boosts LLM reasoning accuracy by 2-3 percentage points over baselines.

Original post by Outongyi Lv, Yanzhao Zheng, Yuanwei Zhang, Zhenghao Huang, Xingjun Wang, Baohua Dong, Hangcheng Zhu, Yingda Chen

"arXiv:2606.31575v1 Announce Type: new Abstract: Reinforcement learning (RL) has become a powerful tool for propelling Large Language Models (LLMs) beyond imitation-based training towards more robust reasoning capabilities. Among existing approaches, RL with Verifiable Rewards (RL…"

View on X

Originally posted by Outongyi Lv, Yanzhao Zheng, Yuanwei Zhang, Zhenghao Huang, Xingjun Wang, Baohua Dong, Hangcheng Zhu, Yingda Chen on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses