Muon^p Optimizer Enhances Finetuning with Fractional Spectral Powers

Yihe Dong, Will Sawin· June 15, 2026 View original

Summary

Muon^p is a new optimizer that generalizes the Muon optimizer by using fractional spectral power updates, interpolating between Muon and gradient descent. It improves validation perplexity and downstream task performance, especially for finetuning billion-scale models, by selectively preserving singular-value information.

The Muon optimizer, widely used in machine learning, operates by replacing a gradient with its polar factor, effectively flattening the singular spectrum. While this approach is beneficial, it discards singular-value information that could be crucial for effective model adaptation. This research introduces Muon^p, a novel optimizer that extends the Muon concept by employing fractional spectral-power updates, specifically $US^pV^\top$ for rational $p \in (0,1)$. This method allows for interpolation between the full flattening of Muon and the traditional gradient descent, offering a more nuanced control over the singular spectrum. To make Muon^p practical, the researchers proved that fractional spectral powers cannot be computed using fixed univariate polynomial iterations. Instead, they derived low-degree odd bivariate recurrences that approximate $US^pV^\top$ using only matrix multiplications. This preserves Muon's original computational structure and complexity, making the new optimizer efficient to implement. The theoretical analysis shows that Muon^p maximizes the linear improvement in loss under the Schatten $q$-norm, where $q = 1 + \frac{1}{p}$. Empirical evaluations demonstrate that Muon^p is particularly effective for finetuning large-scale models, specifically those with billions of parameters. It consistently improves validation perplexity and enhances performance on various downstream tasks. The study also provides insights into scenarios where Muon^p might be less suitable, offering a spectral geometry perspective. These findings highlight the importance of selectively preserving singular spectrum information and introduce a principled method to achieve significant gains in model optimization.

Why it matters

For AI engineers and researchers working with large-scale models, especially during finetuning, Muon^p offers a principled and empirically validated method to achieve better performance. This can lead to more accurate and robust models with improved generalization capabilities.

How to implement this in your domain

  1. 1Evaluate current optimizers: Benchmark the performance of your current optimizers, especially during the finetuning phase of large models.
  2. 2Experiment with Muon^p: Integrate and test Muon^p as an alternative optimizer for finetuning billion-scale models.
  3. 3Analyze spectral geometry: Use spectral geometry insights to understand when Muon^p might be most beneficial for your specific model architectures and tasks.
  4. 4Optimize finetuning strategies: Incorporate fractional spectral power updates to improve validation perplexity and downstream task performance.

Who benefits

AI DevelopmentMachine Learning ResearchSoftware EngineeringCloud ComputingData Science

Key takeaways

  • Muon^p generalizes the Muon optimizer using fractional spectral powers.
  • It interpolates between Muon and gradient descent, preserving singular-value information.
  • Muon^p significantly improves finetuning performance for billion-scale models.
  • The method offers a principled way to achieve gains by selectively managing the singular spectrum.

Original post by Yihe Dong, Will Sawin

"arXiv:2606.13867v1 Announce Type: new Abstract: Muon is an increasingly widely used optimizer that replaces a gradient $G=USV^\top$ with its polar factor $UV^\top$, thereby flattening the singular spectrum. However, full flattening discards singular-value information that may mat…"

View on X

Originally posted by Yihe Dong, Will Sawin on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses