New Residual Scaling Improves Looped Transformer Stability and Transferability

Shaowen Wang, Bingrui Li, Ge Zhang, Wenhao Huang, Shen Yan, Jian Li· June 18, 2026 View original

Summary

Researchers found that looped (weight-tied) Transformers require a stronger residual scaling of `1/N` for stability, where `N` is the loop count, unlike the `1/√L` for standard residual networks. They propose a factored parameterization `λ/(N√L)` that enhances trainability and allows hyperparameter transfer.

Looped Transformers, which reuse a single residual block multiple times to increase effective depth without adding parameters, present unique challenges for stability. While traditional depth-scaling analyses for residual networks suggest a scaling factor of `1/√L` for a depth-L network, this research demonstrates that this is insufficient for looped architectures. The study reveals that weight sharing in looped Transformers causes residual updates to be correlated across iterations, necessitating a stronger scaling factor of `1/N`, where `N` is the number of loops. For multi-layer blocks, a new factored parameterization `λ/(N√L)` is derived. This parameterization effectively separates the two sources of growth: `1/N` for within-layer loop correlation and `1/√L` for across-layer variance. A significant implication of this finding is that the optimal learning rate for these models depends solely on the number of unique layers `L`, not on the loop count `N`. This allows for direct transfer of hyperparameters from smaller to larger `N` configurations without requiring extensive retuning. Experimental results on looped Transformers confirm that the `1/N` scaling significantly improves trainability and leads to better loss compared to `1/√N` scaling across various loop counts.

Why it matters

Optimizing the stability and transferability of looped Transformers is crucial for developing more efficient and scalable deep learning models. This research provides practical guidelines for architects and engineers to design and train deeper, more performant models with fewer parameters and reduced hyperparameter tuning effort.

How to implement this in your domain

  1. 1Adopt the `1/N` residual scaling or the factored `λ/(N√L)` parameterization when designing or implementing looped Transformer architectures.
  2. 2Experiment with transferring hyperparameters, especially learning rates, directly from smaller to larger looped Transformer models.
  3. 3Analyze the stability and training dynamics of existing looped models to identify potential issues related to incorrect residual scaling.
  4. 4Incorporate these scaling principles into custom residual block designs for improved model stability and performance.
  5. 5Educate engineering teams on the specific scaling requirements for weight-tied architectures to avoid common training pitfalls.

Who benefits

AI ResearchMachine Learning EngineeringNatural Language ProcessingComputer VisionCloud Computing

Key takeaways

  • Looped Transformers require a stronger `1/N` residual scaling for stability due to correlated updates.
  • A new `λ/(N√L)` parameterization separates within-layer and across-layer growth factors.
  • Optimal learning rates depend only on unique layers `L`, enabling hyperparameter transfer.
  • Correct scaling significantly improves trainability and loss in looped Transformer models.

Original post by Shaowen Wang, Bingrui Li, Ge Zhang, Wenhao Huang, Shen Yan, Jian Li

"arXiv:2606.18524v1 Announce Type: new Abstract: Looped (weight-tied) Transformers apply a shared residual block $N$ times ($h \leftarrow h + \varepsilon\,f(h)$, same $f$ at each step), increasing effective depth without adding parameters. Prior depth-scaling analyses prescribe $\…"

View on X

Originally posted by Shaowen Wang, Bingrui Li, Ge Zhang, Wenhao Huang, Shen Yan, Jian Li on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses