Logit Scale, Not Weight Norm, Controls Grokking in AI Models

Truong Xuan Khanh· June 18, 2026 View original

Summary

This research investigates the phenomenon of "grokking" in AI models, finding that the logit scale, driven by softmax saturation, is the primary variable controlling the delayed generalization, not the weight norm itself. The weight norm acts as an upstream handle, but its effect is mediated through the logit scale under cross-entropy loss.

Grokking, a phenomenon where neural networks first memorize training data and then suddenly generalize to unseen data after prolonged training, has often been linked to the weight norm, with smaller norms typically leading to earlier generalization. This study delves into the precise mechanism by which the weight norm influences grokking. The researchers found that by fixing the weight norm and only varying an output temperature, they could manipulate the grokking delay across its entire range under cross-entropy loss. They observed that matching the effective logit scale back to the baseline recovered a significant portion of the delay. Across various norms and temperatures, the grokking delay collapsed primarily onto the logit scale, with the weight norm contributing only a minor additional effect. This effect is also loss-dependent; under mean-squared error, the logit scale remains fixed, and the weight norm operates through a different pathway. The findings, supported by various controls and experiments, indicate that the logit scale and the resulting softmax saturation are the direct variables controlling grokking, with the weight norm serving merely as an indirect, upstream control.

Why it matters

Understanding the true mechanisms behind grokking is crucial for optimizing AI training processes, especially for achieving faster and more reliable generalization. Professionals can use this insight to better diagnose training dynamics and potentially design more efficient learning algorithms.

How to implement this in your domain

  1. 1Analyze training dynamics of models exhibiting grokking, focusing on logit scale and softmax saturation.
  2. 2Experiment with output temperature adjustments to control grokking delay, rather than solely relying on weight norm regularization.
  3. 3Develop diagnostic tools to monitor logit scale during training to predict and manage generalization behavior.
  4. 4Consider the implications of loss function choice on the relationship between weight norm, logit scale, and grokking.
  5. 5Apply these insights to fine-tune hyperparameters for faster generalization in deep learning models.

Who benefits

AI EngineeringMachine Learning ResearchSoftware DevelopmentData Science

Key takeaways

  • Grokking's delay is primarily controlled by the logit scale and softmax saturation.
  • The weight norm acts as an indirect, upstream handle on the logit scale under cross-entropy.
  • This relationship is loss-dependent, differing under mean-squared error.
  • Understanding this mechanism can lead to more efficient and predictable AI model training.

Original post by Truong Xuan Khanh

"arXiv:2606.18465v1 Announce Type: new Abstract: Grokking, the delayed jump from memorization to generalization, is usually tied to the weight norm: a smaller norm generalizes sooner. We ask what the norm actually controls. Holding the weight norm fixed by clamping and varying onl…"

View on X

Originally posted by Truong Xuan Khanh on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses