Rubric-Conditioned Self-Distillation Enhances LLM Reasoning

Siyi Gu, Jialin Chen, Sophia Zhou, Arman Cohan, Rex Ying· June 18, 2026 View original

Summary

Researchers propose Rubric-Conditioned Self-Distillation, a novel framework that uses structured, fine-grained rubrics to guide the post-training of reasoning language models. This method provides token-level guidance, offering more detailed feedback than scalar rewards and outperforming existing distillation and reinforcement learning techniques on science reasoning benchmarks.

Traditional methods for post-training reasoning language models, such as supervised distillation and reinforcement learning, often face limitations. Supervised distillation relies on expensive and potentially noisy chain-of-thought annotations, where even correct final solutions can be hindered by imperfect rationales. Reinforcement learning, on the other hand, typically compresses feedback into a single scalar reward, obscuring specific areas for improvement. To address these challenges, a new framework called Rubric-Conditioned Self-Distillation has been introduced. This approach integrates detailed rubrics as structured, fine-grained feedback for on-policy self-distillation. By conditioning the teacher model on criterion-level rubrics, it provides token-level guidance on the student model's sampled trajectories. This design moves beyond relying on a single reference rationale, instead using rubrics to specify the criteria for a strong response, enabling more precise credit assignment during the reasoning process. The framework, instantiated with a two-stage pipeline for rubric generation and rubric-guided reasoning, demonstrated superior performance on various science reasoning benchmarks, surpassing GRPO by 1.0 points and OPSD by 0.9 points on average.

Why it matters

This advancement offers a more effective way to train and refine reasoning capabilities in large language models, leading to more accurate and reliable AI systems. Professionals can leverage this technique to improve the performance of AI agents in complex problem-solving and decision-making tasks.

How to implement this in your domain

  1. 1Adopt rubric-conditioned self-distillation for fine-tuning LLMs in critical reasoning applications.
  2. 2Develop detailed rubrics for evaluating and guiding AI model outputs in specific domains.
  3. 3Integrate fine-grained feedback mechanisms into AI training pipelines to enhance model learning.
  4. 4Apply this framework to improve the accuracy and explainability of AI-driven decision support systems.

Who benefits

AI ResearchEducationHealthcareLegalSoftware Development

Key takeaways

  • Rubric-Conditioned Self-Distillation uses structured rubrics for fine-grained LLM feedback.
  • It provides token-level guidance, overcoming limitations of scalar rewards and noisy annotations.
  • The framework outperforms existing methods on science reasoning benchmarks.
  • This approach enhances the accuracy and reliability of reasoning language models.

Original post by Siyi Gu, Jialin Chen, Sophia Zhou, Arman Cohan, Rex Ying

"arXiv:2606.19327v1 Announce Type: new Abstract: Post-training of reasoning language models is commonly driven by supervised distillation and reinforcement learning with verifiable rewards. Distillation often relies on chain-of-thought annotations that are expensive to obtain and…"

View on X

Originally posted by Siyi Gu, Jialin Chen, Sophia Zhou, Arman Cohan, Rex Ying on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses