LLM Learning Rate Scaling is Nonlinear, Effective Rate Key

Zaiwen Yang, Huaqing Zhang, Jing Xu, Jingzhao Zhang· June 30, 2026 View original

Summary

Research reveals that the optimal learning rate for large language models does not scale log-linearly with model size and data, exhibiting upward curvature at larger scales. This nonlinearity is mitigated by using "effective learning rate" and extrapolating based on data scale, offering more accurate and cost-effective training strategies.

A new study challenges the common assumption that optimal learning rates for training large language models (LLMs) scale log-linearly with model size and data volume. Empirical analysis of GPT-2 style models, ranging from 22 million to 707 million parameters trained on up to 100 billion tokens, demonstrated that the optimal learning rate exhibits an upward curvature at larger scales, leading to inaccuracies when extrapolating from smaller training runs. The researchers found that this nonlinearity largely disappears when the focus shifts from the nominal learning rate to the "effective learning rate," which represents the step size in normalized weight space. Furthermore, extrapolating based on the data scale (D) rather than model size (N) proved to be more accurate. The paper explains this nonlinearity by observing that weight-norm convergence to equilibrium is slower with smaller optimal learning rates, necessitating larger step sizes to shorten the transient training phase. Experiments with AdamH, an optimizer that directly manages the effective learning rate, further supported these findings, providing a clearer path for more efficient and predictable LLM training.

Why it matters

For AI engineers and researchers, this work provides crucial insights into optimizing LLM training, potentially reducing computational costs and improving model performance by refining learning rate scaling strategies. It offers a more accurate method for extrapolating optimal learning rates.

How to implement this in your domain

  1. 1Adopt "effective learning rate" as a primary metric when tuning and scaling LLM training processes.
  2. 2Prioritize data scale (D) for extrapolating optimal learning rates rather than solely relying on model size (N).
  3. 3Experiment with optimizers like AdamH that directly control effective learning rates to improve training stability and efficiency.
  4. 4Re-evaluate existing learning rate schedules and scaling laws in your LLM training pipelines based on these findings.

Who benefits

AI ResearchCloud ComputingSoftware DevelopmentData Centers

Key takeaways

  • Optimal LLM learning rates scale nonlinearly, not log-linearly.
  • "Effective learning rate" and data scale extrapolation improve accuracy.
  • Nonlinearity is linked to weight-norm convergence speed.
  • These findings can lead to more efficient and cost-effective LLM training.

Original post by Zaiwen Yang, Huaqing Zhang, Jing Xu, Jingzhao Zhang

"arXiv:2606.29158v1 Announce Type: new Abstract: Learning-rate transfer can reduce the cost of training large language models: instead of sweeping learning rates at target scale, practitioners extrapolate from smaller runs. Existing approaches often assume that the optimal learnin…"

View on X

Originally posted by Zaiwen Yang, Huaqing Zhang, Jing Xu, Jingzhao Zhang on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses