NebulaExp-8B: Transparent Post-Training Pipeline for LLM Alignment.

Qiaobo Hao, Yangqian Wu, Shunyi Wang, Zhongjian Zhang, Ziqun Li, Yayin He, Muqing Li, Chen Zhong· June 26, 2026 View original

▶ The 2-minute explainer

Summary

NebulaExp-8B presents a fully transparent, ablation-driven post-training pipeline for 8B-scale LLMs, built on Qwen3-8B-base, detailing data construction, filtering, and training recipes. It covers both general instruct and complex reasoning models, achieving significant performance improvements through optimized supervised fine-tuning and reinforcement learning.

Post-training alignment is critical for equipping large language models (LLMs) with strong reasoning abilities and the capacity to follow human preferences. However, many existing works lack detailed information on data construction, filtering rules, and training recipes, which hinders reproducibility and efficient optimization for the broader community. NebulaExp-8B addresses this transparency gap by offering a comprehensive, ablation-driven post-training pipeline built on the Qwen3-8B-base model. It details the curation of a 3.84M multi-source SFT sample corpus and a 200K verifiable RL candidate pool, alongside an end-to-end data processing stack that includes response distillation, multi-dimensional cross-verification filtering, fine-grained difficulty grading, task classification, and diversity-aware sampling. The pipeline develops two orthogonal model branches: a general instruct model and a complex reasoning-specialized model. For the Instruct branch, a three-stage optimized supervised fine-tuning (SFT) approach, NebulaExp-Ins-SFT, improved benchmark scores from 55.01 to 60.99, further elevated to 61.85 with GRPO reinforcement learning. For the Reasoning branch, medium-difficulty GRPO RL boosted scores from 73.88 to 75.17. The research also systematically investigates single-teacher and multi-teacher OPD (MOPD), showing that MOPD, with only 10K samples, can outperform RL baselines by fusing domain-specialist teachers. This work provides a fully reproducible recipe and dissects capability trade-offs across instruction adherence, mathematical reasoning, code generation, and general knowledge.

Why it matters

For AI engineers and researchers, NebulaExp-8B provides an invaluable, transparent blueprint for post-training LLMs, enabling better reproducibility, lightweight model optimization, and a deeper understanding of how different alignment techniques impact model capabilities. This can accelerate the development of more capable and reliable LLMs.

How to implement this in your domain

  1. 1Adopt NebulaExp-8B's transparent data curation and processing methodologies for custom LLM alignment projects.
  2. 2Experiment with the three-stage optimized supervised fine-tuning approach for instruction-following models.
  3. 3Investigate the effectiveness of GRPO reinforcement learning for enhancing both general instruction and complex reasoning capabilities.
  4. 4Explore the multi-teacher OPD (MOPD) approach for efficient alignment with limited data, especially for specialized domains.
  5. 5Utilize the detailed ablation research insights to make informed decisions on capability trade-offs during LLM development.

Who benefits

AI/ML DevelopmentSoftware EngineeringResearch & AcademiaCloud AI ServicesEdTech

Key takeaways

  • Transparency in LLM post-training is crucial for reproducibility and optimization.
  • NebulaExp-8B provides a detailed, ablation-driven pipeline for 8B-scale LLMs.
  • It significantly improves instruction following and complex reasoning capabilities.
  • Multi-teacher OPD offers an efficient alternative to RL for alignment with less data.

Original post by Qiaobo Hao, Yangqian Wu, Shunyi Wang, Zhongjian Zhang, Ziqun Li, Yayin He, Muqing Li, Chen Zhong

"arXiv:2606.26671v1 Announce Type: new Abstract: Post-training alignment determines the reasoning and human preference following capabilities of large language models, yet most existing works withhold detailed data construction, filtering rules and training recipes, which hinders…"

View on X

Originally posted by Qiaobo Hao, Yangqian Wu, Shunyi Wang, Zhongjian Zhang, Ziqun Li, Yayin He, Muqing Li, Chen Zhong on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses