PACE Optimizes Training for Iterate-Averaged LMs.

Kwok Chun Au, Adam Block· June 25, 2026 View original

Summary

This paper introduces PACE, a lightweight optimizer wrapper for AdamW, designed to improve the performance of iterate-averaged language models. By formulating optimizer design as an optimal-control problem, PACE pulls live weights towards their exponential moving average, significantly enhancing model performance.

Many modern language model (LM) training pipelines do not return the final model iterate but rather an averaged version, such as an exponential moving average (EMA) of the training iterates. This practice raises a fundamental question: how can the training process itself be optimized to specifically enhance the performance of this averaged model? This research addresses this by framing optimizer design for the iterate-average estimator as an optimal-control problem. By solving this problem in a continuous-time stochastic quadratic model, the authors derive a control strategy that minimizes the error of the returned average while penalizing intervention size. A practical approximation of this controller leads to PACE, a lightweight wrapper for AdamW. PACE functions by pulling the live model weights towards their exponential moving average using a clipped, per-coordinate control strength. Theoretical proofs show that a stylized PACE converges efficiently, and in quadratic settings, it can substantially improve the limiting squared error of the iterate-average estimator. Empirical results demonstrate that PACE outperforms standard AdamW and EMA-evaluated AdamW in supervised fine-tuning of 1-2B parameter LMs and in GPT-2 pretraining across various hyperparameters.

Why it matters

For AI engineers and researchers, PACE offers a direct and effective method to improve the final performance of language models, especially when iterate averaging is used. This can lead to more robust and higher-performing models without significant architectural changes or computational overhead.

How to implement this in your domain

  1. 1Evaluate current LM training pipelines to determine if iterate averaging is being used.
  2. 2Integrate the PACE optimizer wrapper into existing AdamW-based training setups.
  3. 3Experiment with PACE across different learning rates, decay schedules, and hyperparameters for fine-tuning and pretraining.
  4. 4Benchmark the performance of PACE-trained models against those trained with standard AdamW and EMA-evaluated AdamW.
  5. 5Consider adopting PACE for production-grade LM training to achieve better final model quality.

Who benefits

AI/ML ResearchLarge Language Model DevelopmentSoftware Development

Key takeaways

  • Optimizing for the iterate-averaged model, not just the final iterate, can significantly improve LM performance.
  • PACE is a new optimizer wrapper that pulls live weights towards their EMA.
  • It outperforms standard AdamW and EMA-evaluated AdamW in various LM training scenarios.
  • PACE offers a practical and lightweight approach to enhance model quality.

Original post by Kwok Chun Au, Adam Block

"arXiv:2606.25086v1 Announce Type: new Abstract: Many modern Language Model (LM) pipelines return an averaged model, such as an exponential moving average of the training iterates, rather than the final iterate itself. This raises a fundamental question: given that we will return…"

View on X

Originally posted by Kwok Chun Au, Adam Block on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses