Schatten-p Norms: Optimal Use in Deep Learning Depends on Regime

Thomas Pethick· June 16, 2026 View original

Summary

Research clarifies the optimal application of Schatten-p norms in deep learning, showing that their benefits are regime-dependent. While Schatten-infinity optimizers like Muon perform well, smaller Schatten-p geometries can be superior in low-dimensional settings, including those relevant to Chinchilla scaling.

This paper investigates the optimal use of Schatten-p norms in deep learning optimization, addressing previous conflicting observations about their effectiveness. The findings indicate that the choice of norm is highly dependent on the specific operational regime. While optimizers based on the Schatten-infinity norm, such as Muon, have shown strong empirical results, the study reveals that smaller Schatten-p geometries can be more advantageous. This is particularly true in low-dimensional regimes, a category that includes scenarios related to Chinchilla scaling. The conclusion stems from a new analysis demonstrating noise-robust acceleration within the SODA framework for p-values greater than 2. This analysis also explains why Muon-like methods do not require warm-up periods, naturally favor larger batch sizes, and provides a batch size scaling rule applicable to any p-value.

Why it matters

Deep learning practitioners and researchers can gain a clearer understanding of how to select appropriate optimization techniques, potentially leading to more efficient training, better model performance, and optimized resource utilization, especially when dealing with different model scales and data dimensions.

How to implement this in your domain

  1. 1Evaluate the dimensionality of your deep learning models and datasets to determine the most suitable Schatten-p norm for optimization.
  2. 2Experiment with Schatten-p based optimizers, considering smaller p-values for low-dimensional regimes and larger p-values for high-dimensional scenarios.
  3. 3Adjust batch sizes according to the newly proposed scaling rules for Schatten-p norms to optimize training efficiency.
  4. 4Investigate the implications of these findings for specific model architectures and scaling laws, such as Chinchilla scaling.

Who benefits

AI/ML DevelopmentCloud ComputingData ScienceResearch & Academia

Key takeaways

  • The optimal Schatten-p norm for deep learning optimization is regime-dependent, not universally fixed.
  • Smaller Schatten-p geometries can be superior in low-dimensional settings, including Chinchilla scaling.
  • The analysis provides insights into why Schatten-infinity optimizers like Muon don't require warm-up and favor large batches.
  • A new batch size scaling rule for arbitrary p-values is introduced, guiding optimizer selection.

Original post by Thomas Pethick

"arXiv:2606.15268v1 Announce Type: new Abstract: Schatten-$\infty$ based optimizers such as Muon have shown promising empirical performance, but there remains seemingly conflicting observations regarding whether they are beneficial. We resolve this conflict by showing that the con…"

View on X

Originally posted by Thomas Pethick on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses