Offline RL Losses Show Distinct Weight-Space Geometries and Performance

Aleksandr Nikolich, Igor Kiselev, Vladimir Platonov, Karina Romanova· June 24, 2026 View original

▶ The 2-minute explainer

Summary

A study comparing six offline reinforcement learning losses for distilling reasoning into smaller models reveals distinct weight-space geometries and performance differences. DPO stands out with a near-orthogonal subspace, mode-connectivity barrier, and significantly higher accuracy on reasoning tasks.

Offline reinforcement learning (RL) losses are widely used to distill complex reasoning capabilities from large teacher models into smaller student models. While these methods are typically compared based on their downstream accuracy, this research delves into whether they are mechanistically distinct or if they converge to similar weight updates within the neural network's parameter space. The study trained six different offline RL methods (SFT, RFT, DFT, RIFT, Offline GRPO, DPO) on identical mathematical reasoning rollouts from a Qwen3-4B base model. By analyzing the resulting weight deltas using metrics like cosine similarity, principal-angle subspace analysis, linear mode connectivity, and CKA, distinct patterns emerged. SFT, RFT, and RIFT showed nearly colinear weight deltas and comparable accuracy on GSM8K. DFT diverged more in direction, while Offline GRPO added a substantial orthogonal component to the SFT direction. Crucially, DPO (Direct Preference Optimization) occupied a near-orthogonal subspace, exhibited a mode-connectivity barrier, and showed a collapse in late-layer CKA. Despite using a 10x smaller learning rate (a standard convention), DPO achieved significantly higher accuracy on both GSM8K (93.5%) and AIME26 (30.0%) compared to other methods. This suggests that DPO's unique weight-space geometry contributes to its superior performance in distilling reasoning, highlighting that loss function and optimizer choices jointly impact the update dynamics and final model capabilities.

Why it matters

For AI engineers and researchers working on model distillation and efficient reasoning, understanding the mechanistic differences between offline RL losses is crucial. DPO's superior performance and distinct weight-space behavior offer valuable insights for developing more effective and efficient smaller models capable of complex reasoning.

How to implement this in your domain

  1. 1Evaluate DPO as a primary method for distilling reasoning capabilities into smaller language models.
  2. 2Investigate the impact of learning rate schedules and optimizer choices when applying offline RL losses.
  3. 3Utilize weight-space analysis techniques (e.g., cosine similarity, CKA) to understand the mechanistic differences between training methods.
  4. 4Consider the implications of mode connectivity and subspace orthogonality when selecting and fine-tuning distillation strategies.

Who benefits

AI EngineeringMachine Learning ResearchNatural Language ProcessingSoftware Development

Key takeaways

  • Offline RL losses exhibit distinct weight-space geometries during reasoning distillation.
  • SFT, RFT, and RIFT produce nearly colinear weight deltas and similar accuracies.
  • DPO occupies a near-orthogonal subspace and achieves significantly higher accuracy on reasoning tasks.
  • Loss function and optimizer choices jointly determine update dynamics and model capabilities.

Original post by Aleksandr Nikolich, Igor Kiselev, Vladimir Platonov, Karina Romanova

"arXiv:2606.23740v1 Announce Type: new Abstract: Offline reinforcement-learning losses (RFT, RIFT, DFT, Offline GRPO, DPO) are widely used to distill reasoning from large teachers into smaller students, and are typically compared on downstream accuracy alone. We ask whether they a…"

View on X

Originally posted by Aleksandr Nikolich, Igor Kiselev, Vladimir Platonov, Karina Romanova on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses