Noise Explains Grokking Phenomenon in Deep Neural Networks

Ibrahim Talha Ersoy, Karoline Wiesner· June 17, 2026 View original

Summary

Researchers propose that the "grokking" phenomenon in deep neural networks, where generalization abruptly appears after prolonged overfitting, is explained by noise-driven escape from metastable phases. They demonstrate that SGD noise can drive models across energy barriers separating low-accuracy states from generalized states, consistent with hysteresis in L2 phase transitions.

Deep neural networks (DNNs) exhibit a peculiar phenomenon known as "grokking," where a model suddenly achieves strong generalization performance long after it appeared to have overfit the training data. This delayed onset of generalization has been an open question in the field of deep learning. New research suggests that grokking can be explained by the network's escape from "metastable phases," driven by noise during the training process. The study shows that DNNs undergo first-order phase transitions based on L2 regularization strength, with each transition corresponding to a new learnable feature. Below a critical regularization, multiple features are learnable, but the network can get trapped in low-accuracy metastable states separated by energy barriers. For linear DNNs, the researchers demonstrated that grokking is consistent with hysteresis in these L2 phase transitions. Stochastic Gradient Descent (SGD) noise can provide the necessary impetus for the model to cross these energy barriers, moving from a metastable, low-accuracy state to a generalized state. The escape times follow Arrhenius scaling, and by deliberately trapping models, they reproduced grokking-like delayed convergence. This mechanism likely extends to general nonlinear DNNs, suggesting that task complexity increases the number of metastable states and the potential for hysteresis.

Why it matters

Understanding grokking provides fundamental insights into how deep neural networks learn and generalize, potentially leading to more efficient training schemes and better control over model behavior, especially in complex tasks where generalization is critical.

How to implement this in your domain

  1. 1Analyze training dynamics for signs of grokking or metastable states in deep learning models.
  2. 2Experiment with controlled noise injection or regularization schedules to potentially accelerate escape from metastable phases.
  3. 3Develop diagnostic tools to identify and visualize energy landscapes and phase transitions in neural network training.
  4. 4Consider the implications of task complexity on the potential for grokking and design training strategies accordingly.

Who benefits

AI DevelopmentMachine Learning ResearchSoftware DevelopmentData Science

Key takeaways

  • Grokking in DNNs is explained by noise-driven escape from metastable phases.
  • DNNs exhibit first-order phase transitions related to L2 regularization and learnable features.
  • SGD noise can drive models across energy barriers from low-accuracy to generalized states.
  • This mechanism suggests routes toward more efficient learning schemes by understanding and controlling hysteresis.

Original post by Ibrahim Talha Ersoy, Karoline Wiesner

"arXiv:2606.17120v1 Announce Type: new Abstract: Deep neural networks (DNNs) exhibit first order phase transitions under variations of the L2 regularization strength, with each transition marking the onset of a new learnable feature. Below a critical regularization strength, all f…"

View on X

Originally posted by Ibrahim Talha Ersoy, Karoline Wiesner on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses