Gefen Optimizer Reduces AdamW Memory Footprint by 8x, Boosts Throughput.

Nadav Benedek, Tomer Koren, Ohad Fried· June 15, 2026 View original

Summary

Gefen is a memory-efficient optimizer that significantly reduces AdamW's memory footprint by approximately 8x while maintaining performance. It achieves this by automatically sharing second-moment estimates across parameter blocks and quantizing first moments, enabling larger models or batch sizes and improving throughput in distributed training.

AdamW is a widely adopted optimizer in deep learning, but its memory requirements for storing first and second moment states can be substantial, adding roughly two parameter-sized buffers. This can limit the size of models or batch sizes that can be trained, especially in resource-constrained environments. A new optimizer, Gefen, addresses this challenge by offering a memory-efficient alternative that reduces AdamW's memory footprint by approximately 8x while preserving the same performance levels. Gefen achieves this through two main mechanisms: it automatically shares second-moment estimates across parameter blocks and quantizes the first moment using a learned codebook. This translates to a memory reduction of about 6.5 GiB per billion parameters. The method is theoretically motivated by observations that large mixed Hessian entries constrain squared gradient ratios, suggesting natural candidates for sharing second-moment statistics. Gefen infers block structure from initial squared gradients, requiring no architecture-specific metadata or new hyperparameters. Across various experiments, Gefen consistently achieved the lowest peak optimizer memory among AdamW-like methods while matching AdamW's performance. In distributed training setups like FSDP and DDP, its reduced memory footprint allows for larger microbatches, significantly improving throughput and making it a practical drop-in replacement for training larger models or using bigger batch sizes.

Why it matters

Deep learning engineers and researchers can leverage Gefen to train larger models or use bigger batch sizes, especially in distributed environments, without incurring prohibitive memory costs. This directly translates to more efficient experimentation, faster training times, and the ability to push the boundaries of model scale.

How to implement this in your domain

  1. 1Replace AdamW with Gefen in deep learning training pipelines to reduce optimizer memory usage.
  2. 2Experiment with larger batch sizes or model architectures made possible by Gefen's memory efficiency.
  3. 3Integrate Gefen into distributed training frameworks (e.g., FSDP, DDP) to improve throughput.
  4. 4Benchmark training performance and memory consumption when switching from AdamW to Gefen.

Who benefits

AI DevelopmentCloud ComputingResearch & DevelopmentHigh-Performance ComputingSoftware Engineering

Key takeaways

  • Gefen reduces AdamW's memory footprint by ~8x while maintaining performance.
  • It shares second-moment estimates and quantizes first moments for efficiency.
  • Gefen enables training larger models or using larger batch sizes.
  • It significantly improves throughput in distributed training environments.

Original post by Nadav Benedek, Tomer Koren, Ohad Fried

"arXiv:2606.13894v1 Announce Type: new Abstract: AdamW is a default optimizer for modern deep learning, but its first and second moment states add roughly two parameter-sized buffers to training memory. We propose Gefen, a memory-efficient optimizer that automatically shares secon…"

View on X

Originally posted by Nadav Benedek, Tomer Koren, Ohad Fried on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses