EnerInfer Optimizes On-Device LLM Inference Energy

Bohua Zou, Nian Liu, Binqi Sun, Matteo Mascherin, Debayan Roy, Yutao Liu, Yu Peng, Ning Jia, Haibo Chen· June 24, 2026 View original

▶ The 2-minute explainer

Summary

EnerInfer is a novel framework designed for energy-aware on-device LLM inference, jointly managing energy efficiency, throughput, and thermal comfort. It achieves this by predicting optimal NPU/DDR frequency settings for unseen LLMs and dynamically adjusting configurations to improve energy efficiency without sacrificing quality of experience.

Deploying Large Language Models (LLMs) directly on devices offers benefits like privacy and reduced costs, but it faces significant challenges related to energy consumption and thermal management. Current systems primarily focus on maximizing decoding speed, often overlooking the potential for energy savings. This research highlights that by slightly reducing NPU and memory frequencies, substantial energy efficiency gains and heat reduction can be achieved without compromising the user's quality of experience (QoE). However, identifying the most energy-efficient configuration is complex, as it varies across models, inference engines, platforms, and runtime conditions. Commercial devices also lack detailed component-level power sensing, and shell temperature is influenced by various dynamic factors. To address these issues, EnerInfer proposes a new framework that moves beyond per-model profiling and heavy sensor reliance. EnerInfer employs disaggregated, model-structure-aware prediction and ranking-driven online feedback. It can predict throughput and power for new LLMs across different NPU/DDR frequency settings, select efficient configurations that meet QoE requirements even under runtime interference, and use lightweight thermal prediction to switch between energy-optimized and thermally constrained modes. Evaluations on real-world LLMs demonstrate significant energy efficiency improvements—up to 65% on phones, 12% on laptops, and 24% on development boards—all while maintaining QoE.

Why it matters

This framework enables more sustainable and practical deployment of LLMs on edge devices, extending battery life, reducing heat, and making on-device AI more viable for a wider range of applications.

How to implement this in your domain

  1. 1Adopt EnerInfer's principles for optimizing LLM inference on your edge devices.
  2. 2Implement model-structure-aware prediction to estimate energy consumption and throughput for new LLMs.
  3. 3Develop dynamic frequency scaling strategies based on predicted energy efficiency and thermal constraints.
  4. 4Integrate lightweight thermal prediction into your device management system for adaptive LLM inference.

Who benefits

Mobile ComputingEdge AIConsumer ElectronicsAutomotiveIoT

Key takeaways

  • EnerInfer optimizes on-device LLM inference for energy efficiency, throughput, and thermal comfort.
  • It achieves significant energy savings (up to 65%) without compromising user experience.
  • The framework uses model-structure-aware prediction and dynamic configuration adjustments.
  • EnerInfer addresses challenges like varying optimal settings and lack of detailed power sensing.

Original post by Bohua Zou, Nian Liu, Binqi Sun, Matteo Mascherin, Debayan Roy, Yutao Liu, Yu Peng, Ning Jia, Haibo Chen

"arXiv:2606.23001v1 Announce Type: cross Abstract: On-device LLM inference is increasingly attractive for privacy-preserving, reliable, and cost-effective deployment, yet its energy and thermal costs remain a critical bottleneck. Existing systems primarily optimize for decoding sp…"

View on X

Originally posted by Bohua Zou, Nian Liu, Binqi Sun, Matteo Mascherin, Debayan Roy, Yutao Liu, Yu Peng, Ning Jia, Haibo Chen on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses