SharQ Boosts LLM Inference with FP4 Quantization.

Haoqian Meng, Yilun Luo, Yafei Zhao, Wenyuan Liu, Huaqing Zheng, Xindian Ma, Peng Zhang· June 26, 2026 View original

▶ The 2-minute explainer

Summary

SharQ is a training-free inference method that combines activation sparsity and FP4 quantization for LLMs, using an online sparse-dense decomposition. It significantly reduces latency and improves throughput while recovering substantial accuracy compared to FP16.

Modern AI accelerators increasingly support low-bit floating-point formats and semi-structured sparsity, but effectively combining these for Large Language Model (LLM) activation compression remains a significant challenge. Activations often contain input-dependent outliers that complicate FP4 quantization, and direct application of sparsity masks can lead to information loss. This research introduces SharQ, a novel training-free inference method designed to bridge this gap. SharQ employs an online sparse-dense decomposition for each activation tensor. It generates an input-adaptive N:M mask to extract an outlier-dominated sparse backbone, which is then quantized to FP4. Crucially, a dense residual is defined relative to this quantized sparse backbone, compensating for both mask-induced activation loss and sparse-path quantization error. Both paths share a single FP4 weight payload with path-specific scale views, and a fused kernel integrates mask generation, residual construction, and layer normalization into one efficient operation. This method requires no calibration data, retraining, or model-specific tuning. Evaluated on several Llama and Qwen models, SharQ recovers 43-63% of the accuracy gap between NVFP4 and FP16 across various tasks. On an RTX 5090, it delivers 2.2-2.4x latency reduction over FP16 and 1.2-1.4x throughput improvement over FP8 in language model serving, demonstrating significant efficiency gains.

Why it matters

SharQ offers a practical and highly effective solution for accelerating LLM inference, making large models more deployable and cost-efficient for real-time applications by significantly reducing computational requirements without extensive retraining.

How to implement this in your domain

  1. 1Evaluate current LLM inference pipelines for potential bottlenecks in activation processing.
  2. 2Investigate integrating SharQ or similar sparse-dense decomposition techniques for FP4 quantization.
  3. 3Benchmark the performance gains (latency, throughput) and accuracy trade-offs of low-bit quantization methods.
  4. 4Explore hardware accelerators that support low-bit floating-point formats and semi-structured sparsity.
  5. 5Train engineering teams on advanced quantization and sparsity techniques for LLM deployment.

Who benefits

AI/ML DevelopmentCloud ComputingEdge AISoftware DevelopmentTelecommunications

Key takeaways

  • SharQ combines activation sparsity and FP4 quantization for efficient LLM inference.
  • The method uses an online sparse-dense decomposition to handle activation outliers.
  • It significantly reduces latency and improves throughput without retraining.
  • SharQ recovers substantial accuracy compared to higher precision formats.

Original post by Haoqian Meng, Yilun Luo, Yafei Zhao, Wenyuan Liu, Huaqing Zheng, Xindian Ma, Peng Zhang

"arXiv:2606.26587v1 Announce Type: new Abstract: Low-bit floating-point formats and semi-structured sparsity are increasingly supported by modern accelerators, yet combining them for LLM activation compression remains challenging: activations contain input-dependent outliers that…"

View on X

Originally posted by Haoqian Meng, Yilun Luo, Yafei Zhao, Wenyuan Liu, Huaqing Zheng, Xindian Ma, Peng Zhang on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses