New Scaling Law Optimizes LLM Token Allocation

Fabian Schaipp· July 3, 2026 View original

Summary

This paper proposes a "three-term" scaling law that explicitly accounts for model size, training steps, and batch size, accurately recovering optimal batch size scaling. It allows robust fitting with fewer training runs and derives scaling laws for suboptimal batch sizes, matching previous empirical findings.

Optimizing the training of large language models (LLMs) involves carefully balancing model size, the amount of training data, and computational resources. This research introduces a novel "three-term" scaling law that explicitly incorporates training steps and batch size alongside model size and total data. This new law provides a more granular understanding of how these factors interact to influence model performance. By fitting this proposed law to a large dataset of training runs, the researchers found that it accurately predicts the scaling behavior of the optimal batch size. A key advantage of this three-term law is its robustness: it can be reliably fitted with significantly fewer training runs, even those with suboptimal batch sizes, making it more efficient for research and development. Furthermore, the law can be used to derive specific scaling relationships for scenarios where batch sizes are not optimal. These derived relationships align well with previously observed empirical findings regarding the "critical batch size," which marks a transition point in training dynamics. This work offers a more principled and efficient way to allocate computational tokens during LLM training.

Why it matters

For professionals involved in training large AI models, particularly LLMs, this research provides a more precise and efficient framework for resource allocation. Understanding these scaling laws can lead to faster training, better model performance, and significant cost savings by optimizing batch size and training steps.

How to implement this in your domain

  1. 1Review current LLM training strategies for token allocation and batch size optimization.
  2. 2Investigate applying the proposed "three-term" scaling law to predict optimal training configurations.
  3. 3Experiment with dynamic batch sizing strategies informed by the new scaling law to improve training efficiency.
  4. 4Utilize the law to derive scaling predictions for suboptimal batch sizes, guiding resource allocation in constrained environments.
  5. 5Educate engineering teams on the implications of this scaling law for future LLM development and deployment.

Who benefits

AI InfrastructureCloud ComputingSoftware DevelopmentResearch & DevelopmentData Centers

Key takeaways

  • A new "three-term" scaling law optimizes LLM training by considering model size, steps, and batch size.
  • It accurately predicts optimal batch size scaling and is robustly fit with fewer training runs.
  • The law helps understand performance with suboptimal batch sizes.
  • This offers a more efficient and principled approach to token allocation in LLM training.

Original post by Fabian Schaipp

"arXiv:2607.01487v1 Announce Type: new Abstract: We propose a scaling law that takes into account model size and training data while explicitly splitting the latter into training steps and batch size (called three-term law). Fitting the proposed law on a large set of training runs…"

View on X

Originally posted by Fabian Schaipp on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses