Weight Norm Dictates Grokking Timescale in Neural Networks

Truong Xuan Khanh, Doan Hoang Viet, Luu Duc Trung, Phan Thanh Duc· June 15, 2026 View original

Summary

This paper investigates the causal relationship between weight norm and "grokking," the delayed generalization in neural networks. By intervening on the weight norm during training, researchers found that grokking occurs when the norm reaches a critical value, and clamping the norm to a multiple of this value exponentially affects the delay. This establishes a causal delay law for grokking.

This research addresses the debate surrounding the role of weight norm in "grokking," a phenomenon where neural networks achieve generalization long after perfectly fitting their training data. Previous studies offered conflicting views on whether a critical weight norm value directly causes this delay. The authors settled this dispute by actively manipulating the weight norm during training, rather than merely observing it. They discovered that under free training with weight decay, networks grok when their weight norm reaches a specific critical value, which remains consistent across different training conditions. Crucially, when the weight norm was clamped to a fixed multiple of this critical value, the grokking delay followed an exponential law. This demonstrates a causal link, showing that the weight norm directly dictates the timescale of grokking. The study also noted that LayerNorm can decouple weight scale from network function, removing this dependence.

Why it matters

Understanding the mechanisms behind grokking is crucial for optimizing neural network training and improving generalization capabilities. Professionals in AI research and engineering can leverage this causal law to design more efficient training regimes and predict model behavior.

How to implement this in your domain

  1. 1Adjust weight decay and regularization strategies to control weight norm and influence grokking behavior.
  2. 2Experiment with different LayerNorm placements to decouple weight scale from generalization dynamics.
  3. 3Develop diagnostic tools to monitor weight norm during training and predict the onset of grokking.
  4. 4Incorporate insights into training schedules to achieve desired generalization performance more predictably.

Who benefits

AI EngineeringAI ResearchMachine Learning Development

Key takeaways

  • Weight norm causally determines the grokking timescale in neural networks.
  • Grokking occurs when the weight norm reaches a critical, consistent value.
  • Clamping the weight norm exponentially affects the delay until generalization.
  • LayerNorm can decouple weight scale from network function, altering this dependence.

Original post by Truong Xuan Khanh, Doan Hoang Viet, Luu Duc Trung, Phan Thanh Duc

"arXiv:2606.13753v1 Announce Type: new Abstract: Grokking is the delayed onset of generalization in neural networks, arising long after they fit the training data. Whether the weight norm causes this delay is disputed: some studies report a critical norm at the transition, others…"

View on X

Originally posted by Truong Xuan Khanh, Doan Hoang Viet, Luu Duc Trung, Phan Thanh Duc on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses