New Optimizer Improves Language Model Training Efficiency and Performance.

Kathan Shah· July 3, 2026 View original

Summary

This research introduces Ember, a lightweight optimizer specifically designed for the embedding table and LM-head matrices in language models, significantly reducing VRAM usage compared to Adam. Ember exploits the unique gradient geometry of these components, improving performance across finetuning, RL, and pretraining.

Language models operate by translating discrete symbols into continuous programs, with the embedding table and LM-head serving as crucial interfaces. This paper highlights that these interfaces possess a distinct gradient geometry compared to dense hidden weights, a characteristic that can be leveraged for optimization. By exploiting this, researchers have developed Ember, a novel, lightweight optimizer. Ember is specifically designed for embedding and LM-head matrices, offering substantial VRAM savings (O(V + D) versus Adam's O(2VD)) and eliminating the need to shard token table optimizer states. Empirical evidence demonstrates Ember's effective scalability across various batch sizes and parameter counts, leading to improved performance in supervised finetuning, reinforcement learning, and pretraining. The work also suggests that token optimization trajectories are surprisingly simple, challenging conventional views of neural network landscapes, and provides an open-source distributed implementation compatible with existing setups.

Why it matters

AI engineers and researchers can achieve significant memory savings and potentially faster, more efficient training of large language models, making advanced models more accessible and cost-effective to develop and deploy.

How to implement this in your domain

  1. 1Review the Ember optimizer's implementation details and integrate it into existing Transformer training pipelines.
  2. 2Benchmark Ember against current optimizers like Adam for embedding and LM-head layers to quantify VRAM savings and performance gains.
  3. 3Explore applying Ember in resource-constrained environments or for training extremely large language models.
  4. 4Contribute to the open-source project to further develop and refine the optimizer.

Who benefits

AI/ML DevelopmentCloud ComputingResearch & AcademiaSoftware Development

Key takeaways

  • The embedding table and LM-head have unique gradient geometry exploitable for optimization.
  • Ember is a new lightweight optimizer that significantly reduces VRAM for these components.
  • It improves performance across finetuning, RL, and pretraining tasks.
  • Ember scales effectively and is compatible with existing distributed training setups.

Original post by Kathan Shah

"arXiv:2607.01455v1 Announce Type: new Abstract: Language models learn continuous programs over discrete symbols, with the embedding table and LM-head acting as the read/write interface between them. We show that this interface has gradient geometry distinct from dense hidden weig…"

View on X

Originally posted by Kathan Shah on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses