Small Initialization Improves LLM Pretraining and Reasoning

Liangkai Hang, Junjie Yao, Zhiyu Li, Feiyu Xiong, Hongkang Yang, Zhi-Qin John Xu· June 17, 2026 View original

Summary

Research demonstrates that the scale of parameter initialization is a crucial factor in large language model training and capacity, with smaller initialization consistently enhancing pretraining and significantly boosting performance on reasoning tasks. This cost-free intervention drives a distinct developmental trajectory, leading to richer representations and improved context-constrained predictions.

While advancements in large language models (LLMs) are often attributed to increased scale, data volume, and architectural innovations, new research highlights the critical role of parameter initialization. This study reveals that the initial scale of parameters acts as a fundamental determinant of both the training process and the ultimate capacity of an LLM. Specifically, reducing the initialization scale consistently leads to improvements during pretraining, with the most substantial benefits observed in tasks that demand complex reasoning. The researchers identified two common empirical settings that inadvertently limit the advantages of small initialization and demonstrated that adjusting these settings restores its positive scaling effects. They also pinpointed an optimal initialization point that balances reasoning capabilities with overall training efficiency. Mechanistically, a smaller initialization scale appears to guide a unique developmental path for the model. Parameters initially condense into simpler structures before evolving into more complex and rich representations. This process supports the concept that compression is intrinsically linked to intelligence. Token-level analysis further indicates that the performance gains are concentrated on challenging, context-dependent predictions rather than being uniformly distributed across all tokens. These findings advocate for a straightforward "gamma-initialization" rule: make initialization range an explicit tunable parameter and default to smaller initialization, offering a nearly cost-free method to enhance pretraining and strengthen reasoning across various model sizes.

Why it matters

For AI engineers and researchers, this finding offers a simple yet powerful, almost cost-free method to significantly improve the performance and reasoning capabilities of large language models. Optimizing initialization can lead to more efficient training and more capable models, directly impacting the development of next-generation AI systems.

How to implement this in your domain

  1. 1Experiment with smaller parameter initialization scales when training new large language models.
  2. 2Review and adjust existing empirical settings in your training pipelines that might be restraining the benefits of small initialization.
  3. 3Implement the proposed "gamma-initialization" rule, exposing initialization range as a tunable hyperparameter.
  4. 4Prioritize evaluating models with different initialization scales on reasoning-intensive benchmarks.
  5. 5Share findings and best practices within your team to leverage this optimization for future LLM development.

Who benefits

AI DevelopmentResearch & DevelopmentSoftware EngineeringHigh-Performance Computing

Key takeaways

  • Parameter initialization scale is a critical, often overlooked, factor in LLM training and capacity.
  • Smaller initialization consistently improves pretraining and boosts reasoning performance.
  • This optimization is nearly cost-free and can be applied across various model scales.
  • Small initialization drives a developmental trajectory from low-complexity to richer representations.

Original post by Liangkai Hang, Junjie Yao, Zhiyu Li, Feiyu Xiong, Hongkang Yang, Zhi-Qin John Xu

"arXiv:2606.17945v1 Announce Type: new Abstract: Large language models provide a tractable system for asking how intelligence itself emerges, rather than only how LLMs can be engineered. Although progress is usually attributed to scale, data and architecture, we show that paramete…"

View on X

Originally posted by Liangkai Hang, Junjie Yao, Zhiyu Li, Feiyu Xiong, Hongkang Yang, Zhi-Qin John Xu on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses