Aurora Optimizer Improves Wide MLP Layer Training Efficiency

Alec Dewulf, Dhruv Pai, Li Yang, Ashley Zhang, Ben Keigwin· June 29, 2026 View original

▶ The 2-minute explainer

Summary

Aurora, a new spectral optimizer, addresses non-uniform row norms in matrix parameter updates, a problem that hinders wide MLP layer training. By enforcing row-uniformity while maintaining desirable update geometry, Aurora outperforms existing methods and achieves state-of-the-art performance in specific benchmarks.

Training neural networks, especially those with wide Multi-Layer Perceptron (MLP) layers, can be challenging due to issues with optimizer updates. Specifically, existing methods like Muon can lead to arbitrarily non-uniform row norms in matrix parameter updates, causing some neurons to receive persistently small updates and become ineffective. While row normalization can mitigate this, current techniques often distort the desired update geometry. Researchers have developed Aurora, a novel spectral optimizer designed to enforce row-uniformity in matrix parameter updates without compromising the polar factor geometry of the momentum matrix, which is crucial for effective learning. Aurora demonstrates superior performance over Muon in pre-training experiments. When combined with other state-of-the-art methods, it achieves top results among spectral optimizers on the modded-nanoGPT speedrun. Its empirical gains scale with the MLP expansion factor, suggesting it could enable more effective training of very wide MLP layers.

Why it matters

AI engineers and researchers can leverage Aurora to train larger and more complex neural networks, particularly those with wide MLP layers, more efficiently and effectively, potentially leading to more powerful and performant models.

How to implement this in your domain

  1. 1Review current optimizer choices for training large neural networks, especially those with wide MLP layers.
  2. 2Experiment with integrating the Aurora optimizer into existing deep learning frameworks.
  3. 3Benchmark Aurora's performance against other spectral optimizers on internal models and datasets.
  4. 4Consider designing models with wider MLP layers, leveraging Aurora's ability to train them effectively.

Who benefits

AI/ML DevelopmentCloud ComputingResearch & AcademiaData Science

Key takeaways

  • Non-uniform row norms in optimizers hinder wide MLP layer training.
  • Aurora is a new spectral optimizer that enforces row-uniformity.
  • It maintains desirable update geometry, outperforming existing methods.
  • Aurora enables more effective training of very wide MLP layers.

Original post by Alec Dewulf, Dhruv Pai, Li Yang, Ashley Zhang, Ben Keigwin

"arXiv:2606.27715v1 Announce Type: new Abstract: We show that for tall matrix parameters, like projection matrices in the MLP layers, the Muon update can have row norms that are arbitrarily non-uniform. This can lead to a self-reinforcing feedback loop whereby neurons receive pers…"

View on X

Originally posted by Alec Dewulf, Dhruv Pai, Li Yang, Ashley Zhang, Ben Keigwin on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses