Analyzing Transformer Weight Distribution Dynamics Under AdamW Training

Tiexin Ding· June 19, 2026 View original

Summary

This research investigates the evolution of the Weibull weight-scale parameter during AdamW training of transformers, decomposing the squared weight norm into alignment, injection, and decay forces. It finds that alignment dominates the initial growth phase and introduces a spline displacement method to recover alignment force from sparse checkpoints, offering insights into training dynamics.

This study delves into the dynamics of transformer weight distributions during AdamW training, specifically focusing on the behavior of the Weibull weight-scale parameter, denoted as $\lambda$. The parameter is observed to grow, overshoot, and then relax over the course of training. To understand this phenomenon, the researchers derive a leading-order three-force decomposition of the squared weight norm from the AdamW update rule. These three forces are: an alignment force, which measures the correlation between weights and the adaptive update direction; an injection force, derived from the adaptive step magnitude; and a decay force, resulting from decoupled weight decay. Experiments on self-trained Pythia-70M models with ground-truth optimizer moments reveal that the alignment force is the primary driver during the initial rise phase of $\lambda$, contributing 88-94% of the absolute force budget across different random seeds. As training approaches saturation, the alignment and decay forces balance, explaining the transition from weight-scale growth to relaxation. To extend this analysis to real-world models where optimizer moments are often unavailable, the paper introduces a spline displacement method. This method can recover the alignment force from sparsely sampled checkpoints with approximately 92-94% accuracy, significantly outperforming a naive two-point baseline. The study also notes that the peak value of $\lambda(t)$ appears to correlate with the coherence of the training data, suggesting a data-dependent component to weight-scale growth, which is slated for further investigation.

Why it matters

Understanding the dynamics of weight distributions during training is crucial for optimizing transformer performance, improving training stability, and potentially designing more efficient and robust deep learning architectures.

How to implement this in your domain

  1. 1Monitor weight distribution parameters, like the Weibull scale parameter, during transformer training.
  2. 2Analyze the contributions of alignment, injection, and decay forces to weight norm changes in custom training loops.
  3. 3Utilize the spline displacement method to estimate alignment forces from sparse training checkpoints.
  4. 4Investigate the impact of training data coherence on weight-scale growth and model stability.

Who benefits

AI ResearchDeep Learning EngineeringCloud ComputingAutonomous SystemsScientific Computing

Key takeaways

  • Transformer weight distributions exhibit predictable growth and relaxation patterns during AdamW training.
  • The "alignment force" is the dominant factor driving initial weight-scale growth.
  • A balance between alignment and decay forces explains weight-scale saturation.
  • A spline displacement method allows accurate force estimation from sparse checkpoints.

Original post by Tiexin Ding

"arXiv:2606.19367v1 Announce Type: new Abstract: Building on a two-parameter Weibull framework for diagnosing transformer weight distributions, we study why the Weibull weight-scale parameter $\lambda$ grows, overshoots, and then relaxes during AdamW training. We derive a leading-…"

View on X

Originally posted by Tiexin Ding on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses