Strategy-Guided Optimization Improves LLM Reasoning Beyond Imitation.

Tianyuan Shi, Canbin Huang, Bei Li, Xin Chen, Xiaojun Quan, Jingang Wang, Qifan Wang· June 24, 2026 View original

▶ The 2-minute explainer

Summary

This paper introduces Strategy-Guided Policy Optimization (SGPO), a framework that distills reusable problem-solving strategies from strong LLMs to weaker ones, rather than just imitating specific solution trajectories. SGPO uses a token-level forward-KL objective and adaptive instance weighting to improve generalization and consistently outperforms baseline methods on mathematical benchmarks.

This research proposes Strategy-Guided Policy Optimization (SGPO), a novel approach to improve the reasoning capabilities of large language models (LLMs) by focusing on distilling reusable problem-solving strategies. Unlike traditional trajectory imitation, which often leads to memorization of specific steps, SGPO aims to transfer the "how to reason" rather than just "what to answer" from powerful models to weaker ones, thereby enhancing generalization to new problems. SGPO operates by extracting structured strategy descriptions and constructing both autonomous and strategy-guided trajectories for each problem, allowing for direct comparison. It employs a token-level forward-KL objective to selectively transfer the distributional shift induced by strategy conditioning into the unguided policy, ensuring stability with proximal constraints. Additionally, adaptive instance-level weighting adjusts guidance based on the model's evolving competence. Experiments across four mathematical benchmarks demonstrate that SGPO consistently surpasses existing methods like SFT and on-policy RL, significantly improving average scores and highlighting the effectiveness of strategy distillation over direct trajectory imitation.

Why it matters

For AI engineers and researchers, SGPO offers a more effective method for training LLMs to reason, leading to models that are more adaptable, generalize better to novel problems, and require less fine-tuning for specific tasks.

How to implement this in your domain

  1. 1Explore methods for extracting and formalizing problem-solving strategies from expert demonstrations or strong LLMs.
  2. 2Implement strategy-guided policy optimization techniques in your LLM training pipelines.
  3. 3Experiment with forward-KL objectives and adaptive weighting schemes for more efficient knowledge distillation.
  4. 4Apply SGPO principles to improve the reasoning capabilities of LLMs in complex problem-solving domains.
  5. 5Develop benchmarks that specifically test for generalization of reasoning strategies rather than just task performance.

Who benefits

AI EngineeringEducationSoftware DevelopmentRoboticsResearch & Development

Key takeaways

  • SGPO distills reusable problem-solving strategies to improve LLM reasoning.
  • It moves beyond trajectory imitation to enhance generalization to novel problems.
  • A token-level forward-KL objective and adaptive weighting are key components.
  • SGPO consistently outperforms baselines on mathematical reasoning benchmarks.

Original post by Tianyuan Shi, Canbin Huang, Bei Li, Xin Chen, Xiaojun Quan, Jingang Wang, Qifan Wang

"arXiv:2606.24064v1 Announce Type: new Abstract: Distilling reasoning capabilities from strong to weak language models typically involves imitating specific solution trajectories, effectively transferring what to answer rather than how to reason. This trajectory-level imitation en…"

View on X

Originally posted by Tianyuan Shi, Canbin Huang, Bei Li, Xin Chen, Xiaojun Quan, Jingang Wang, Qifan Wang on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses