SFT Overtraining Predicts Rank Inversion in RLHF Models

Siddharth Aphale, Kelly Liu· June 18, 2026 View original

Summary

This research shows that overtraining in Supervised Fine-Tuning (SFT) can lead to "rank inversion" in Reinforcement Learning from Human Feedback (RLHF) models, where models with higher initial performance end up performing worse after RL. This phenomenon is linked to an entropy collapse in the SFT rollout distribution, particularly under binary rewards.

In the development of large language models, the standard practice of selecting the Supervised Fine-Tuning (SFT) checkpoint with the highest initial performance (e.g., pass@1) for subsequent Reinforcement Learning from Human Feedback (RLHF) can be counterproductive. This study reveals that overtraining during the SFT phase can lead to a phenomenon called "rank inversion," where models that initially perform better ultimately achieve worse results after RL. The core mechanism behind this failure is identified as an entropy collapse within the SFT rollout distribution, especially when dealing with binary rewards. When SFT compresses this distribution too much, the expected variance of advantage within groups diminishes, leading to a lack of meaningful signal for the RL algorithm. This makes it difficult for the RL phase to effectively improve the model. Experiments with Qwen2.5-Coder-3B and DeepSeek-Coder-6.7B models demonstrated this effect. For Qwen, increasing SFT depth led to higher pre-RL pass@1 but a significant drop in peak GRPO pass@10, with pre-RL entropy positively correlating with the RL outcome. A two-stage diagnostic combining pre-RL entropy triage with an early GRPO entropy monitor is proposed to identify and stop failing runs early, suggesting that simple regularization techniques may not fully mitigate this issue.

Why it matters

For AI engineers and researchers working on large language models and RLHF, understanding the pitfalls of SFT overtraining is critical for optimizing model development. This insight can prevent wasted computational resources and lead to more effective and robust LLMs.

How to implement this in your domain

  1. 1Implement a two-stage diagnostic protocol for SFT checkpoints, combining pre-RL entropy triage with an early RL entropy monitor.
  2. 2Avoid selecting SFT checkpoints solely based on peak pass@1 scores, especially when using binary rewards for RLHF.
  3. 3Monitor the entropy of the SFT rollout distribution to detect potential collapse before proceeding to RL.
  4. 4Experiment with different SFT training durations and regularization techniques to prevent overtraining and entropy collapse.
  5. 5Adjust RLHF training strategies to account for potential rank inversion, focusing on maintaining sufficient signal for the RL algorithm.

Who benefits

AI EngineeringNatural Language ProcessingSoftware DevelopmentMachine Learning Research

Key takeaways

  • SFT overtraining can lead to "rank inversion" where better SFT models perform worse after RLHF.
  • This is caused by an entropy collapse in the SFT rollout distribution, especially with binary rewards.
  • A two-stage diagnostic using pre-RL and early RL entropy monitoring can predict and prevent failures.
  • Selecting SFT checkpoints based solely on pass@1 can be misleading for subsequent RLHF.

Original post by Siddharth Aphale, Kelly Liu

"arXiv:2606.18487v1 Announce Type: new Abstract: The standard heuristic of selecting the SFT checkpoint with the highest pass@1 for GRPO can fail when SFT compresses the rollout distribution. For binary rewards, the expected within group advantage variance is $p(1{-}p)(g{-}1)/g$;…"

View on X

Originally posted by Siddharth Aphale, Kelly Liu on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses