ExTra Improves Language Model Reasoning with Exploratory Trajectory Optimization

Wenyang Hu, Junxiang Jia, Zhen Shu, Daniel Dahlmeier, See-Kiong Ng, Bryan Kian Hsiang Low· June 25, 2026 View original

Summary

This paper introduces ExTra, a framework for Reinforcement Learning with Verifiable Rewards (RLVR) that enhances language model reasoning by extracting exploration signals from model rollouts. It uses novelty rewards for diversity and entropy-guided prefix regeneration to improve accuracy on mathematical reasoning benchmarks.

Reinforcement Learning with Verifiable Rewards (RLVR) for language models often struggles with tasks that are either too easy or too hard, leading to insufficient gradient signals. To address this, researchers developed ExTra, or Exploratory Trajectory Optimization, a new framework compatible with GRPO. ExTra aims to improve exploration during the learning process. The ExTra framework incorporates two main mechanisms. First, it uses a novelty reward that adds diversity bonuses based on embedding, encouraging the model to find varied correct solutions. Second, it employs entropy-guided prefix regeneration, which evaluates partial trajectories using entropy signals and allows for continued exploration from promising intermediate steps. This approach helps the model navigate complex problem spaces more effectively. Evaluations across six mathematical reasoning benchmarks showed significant improvements. ExTra boosted the Qwen3-1.7B model's performance over GRPO by approximately 5 points on pass@1 and 7 points on pass@16. These results highlight that incorporating trajectory-level exploration signals can substantially enhance both single-sample accuracy and the overall coverage during inference.

Why it matters

Professionals working with large language models can leverage this research to develop more robust and accurate AI systems, especially for complex reasoning tasks where current RL methods fall short. It offers a pathway to improve model performance and reliability in critical applications.

How to implement this in your domain

  1. 1Investigate integrating ExTra's novelty reward mechanism into existing RLHF pipelines to encourage diverse and correct model outputs.
  2. 2Experiment with entropy-guided prefix regeneration to guide exploration in challenging language model tasks, focusing on promising intermediate steps.
  3. 3Evaluate the ExTra framework on custom language model applications, particularly those requiring complex reasoning or problem-solving.
  4. 4Adapt the principles of trajectory-level exploration to fine-tune models for specific domain expertise, aiming for higher accuracy and broader solution coverage.

Who benefits

AI DevelopmentSoftware EngineeringEducationResearch & Development

Key takeaways

  • ExTra improves language model reasoning by addressing limitations in RLVR for tasks of varying difficulty.
  • The framework uses novelty rewards and entropy-guided prefix regeneration for better exploration.
  • It significantly boosts accuracy on mathematical reasoning benchmarks for models like Qwen3-1.7B.
  • Trajectory-level exploration signals are crucial for enhancing both single-sample accuracy and inference-time coverage.

Original post by Wenyang Hu, Junxiang Jia, Zhen Shu, Daniel Dahlmeier, See-Kiong Ng, Bryan Kian Hsiang Low

"arXiv:2606.24994v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) for language-model reasoning can fail at both extremes of task difficulty: easy prompts often produce all-correct, low-diversity rollout groups with little gradient signal, while…"

View on X

Originally posted by Wenyang Hu, Junxiang Jia, Zhen Shu, Daniel Dahlmeier, See-Kiong Ng, Bryan Kian Hsiang Low on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses