Fidelity Metrics Fail to Predict Quantized LLM Performance in Critical Zone

Milo\v{s} Nikoli\'c, Ali Hadi Zadeh, Enrique Torres Sanchez, Andreas Moshovos· June 19, 2026 View original

Summary

A study reveals that common fidelity metrics like per-token KL divergence (KLD) are poor predictors of benchmark quality for quantized Large Language Models (LLMs) in the "silent zone" near baseline performance. While KLD correlates strongly with performance across a wide range of quantization levels, this relationship collapses when models are close to high-precision performance, making it unreliable for fine-grained evaluation.

In the deployment of quantized Large Language Models (LLMs), fidelity metrics such as per-token KL divergence (KLD) are frequently used as a low-cost proxy for assessing benchmark quality. This practice was tested across various quantized versions of Qwen3.6-35B-A3B and Devstral-Small-2-24B models, evaluated against a suite of downstream benchmarks. The findings indicate that KLD shows a strong correlation with benchmark scores when considering the full range of quantization levels. However, this correlation significantly diminishes and becomes statistically non-significant within the "silent zone," which refers to the performance range near the high-precision baseline. This breakdown in correlation was observed across 14 different measurement variations, including various KLD aggregations, perplexity formulations, and calibration corpora. Further analysis at the per-prompt level showed that KLD has only weak predictive power for identifying failures in code generation tasks and is ineffective as a cross-model router. The study attributes this collapse to KLD primarily measuring the volume of disagreement with a reference model, rather than the direction of those disagreements, especially in the critical silent zone where subtle performance shifts matter most.

Why it matters

Professionals working on deploying quantized LLMs need reliable metrics to evaluate model quality and select the best quantization strategies. This research highlights a critical flaw in commonly used fidelity metrics, urging a re-evaluation of current evaluation practices to avoid misleading conclusions and ensure robust model performance.

How to implement this in your domain

  1. 1Re-evaluate your current LLM quantization evaluation pipelines, especially for models operating in the "silent zone" near baseline performance.
  2. 2Avoid relying solely on per-token KL divergence or similar fidelity metrics for fine-grained performance assessment of quantized LLMs.
  3. 3Prioritize direct benchmark evaluations over proxy metrics when selecting between high-performing quantized models.
  4. 4Investigate alternative or complementary evaluation methods that capture the "direction" of performance changes, not just the "volume" of deviation.

Who benefits

AI DevelopmentMachine Learning EngineeringCloud ComputingSoftware Engineering

Key takeaways

  • Common fidelity metrics like KLD are unreliable for evaluating quantized LLMs near baseline performance.
  • KLD primarily measures the volume of disagreement, not the direction, leading to misleading results.
  • Relying solely on KLD for fine-grained quantization evaluation can lead to suboptimal model deployment.
  • Direct benchmark evaluations are crucial for accurately assessing high-performing quantized LLMs.

Original post by Milo\v{s} Nikoli\'c, Ali Hadi Zadeh, Enrique Torres Sanchez, Andreas Moshovos

"arXiv:2606.19558v1 Announce Type: new Abstract: Fidelity metrics, such as per-token KL divergence (KLD) against a high-precision reference, are often used in practice as low-cost proxies for benchmark quality. We test this practice on a 28-quant cohort of Qwen3.6-35B-A3B and a 41…"

View on X

Originally posted by Milo\v{s} Nikoli\'c, Ali Hadi Zadeh, Enrique Torres Sanchez, Andreas Moshovos on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses