New Pruning Method Boosts Sparse MoE LLM Performance.

Yongqin Zeng, Sicheng Pan, Jiale Wang, Hai-tao Zheng, Hong-Gee Kim, Chunxia Ma, XiuTeng Zhou· July 3, 2026 View original

▶ The 2-minute explainer

Summary

This paper introduces Generic TB-Coverage, a novel coverage-aware expert pruning method for Sparse Mixture-of-Experts (MoE) language models that uses only generic text corpora for calibration. It preserves high-utility experts from diverse corpora, significantly improving accuracy on benchmarks and reducing perplexity degradation, especially under aggressive pruning budgets.

Sparse Mixture-of-Experts (MoE) language models contain considerable redundancy among their routed experts, yet pruning them effectively without specific downstream calibration data remains a challenge. Existing pruning methods often rely on a single aggregated importance score, which can inadvertently bias the retained experts towards patterns dominant in the calibration data. This research proposes Generic TB-Coverage, a new approach to address this. Instead of collapsing expert utility into a single score, Generic TB-Coverage profiles each expert's utility separately across multiple generic text corpora, such as WikiText2 and C4. It then enforces a fixed-budget coverage rule, ensuring that high-utility experts from each corpus are preserved before constructing the final pruning mask. This method avoids bias by considering a broader range of expert contributions. Evaluations on Qwen1.5-MoE-A2.7B and DeepSeek-MoE-16B-Base models, with pruning budgets of 25%, 50%, and 75% retention, showed that Generic TB-Coverage improved average accuracy on six common zero-shot benchmarks. It also reduced perplexity degradation on the calibration corpora, with the most significant gains observed under aggressive pruning. This indicates that preserving cross-corpus expert coverage is an effective prior for MoE pruning using generic data.

Why it matters

For professionals working with large language models, particularly MoE architectures, this research provides a more efficient and effective method for model compression and optimization. It allows for significant size reduction without sacrificing performance, making these powerful models more deployable and cost-effective.

How to implement this in your domain

  1. 1Investigate Generic TB-Coverage for pruning Sparse MoE models to optimize deployment size and inference costs.
  2. 2Apply coverage-aware pruning methods using diverse generic text corpora for model calibration.
  3. 3Benchmark the performance of pruned MoE models on zero-shot tasks to validate accuracy improvements.
  4. 4Develop internal tools to profile per-expert utility across different datasets for more informed pruning decisions.
  5. 5Consider aggressive pruning strategies for MoE models to maximize efficiency while maintaining performance.

Who benefits

AI/ML DevelopmentCloud ComputingEdge AITelecommunicationsData Centers

Key takeaways

  • Generic TB-Coverage improves pruning of Sparse MoE language models.
  • It uses generic text corpora for calibration, avoiding downstream data bias.
  • The method preserves high-utility experts from diverse corpora.
  • It boosts accuracy and reduces perplexity degradation, especially with aggressive pruning.

Original post by Yongqin Zeng, Sicheng Pan, Jiale Wang, Hai-tao Zheng, Hong-Gee Kim, Chunxia Ma, XiuTeng Zhou

"arXiv:2607.01710v1 Announce Type: new Abstract: Sparsely activated Mixture-of-Experts (MoE) language models contain substantial structured redundancy among routed experts, but pruning them without downstream calibration data remains challenging. Existing expert-pruning methods ty…"

View on X

Originally posted by Yongqin Zeng, Sicheng Pan, Jiale Wang, Hai-tao Zheng, Hong-Gee Kim, Chunxia Ma, XiuTeng Zhou on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses