MosaicKV Compresses LLM KV Cache for Long Contexts

Sheng Qiang, Ruiwei Chen, Yinpeng Wu, Jinyu Gu, Zhichao Hua, Yubin Xia, Binyu Zang, Haibo Chen· July 2, 2026 View original

Summary

MosaicKV is a dynamic two-dimensional KV cache compression system designed for serving extremely long-context LLMs, addressing memory and throughput challenges. It achieves up to 16x attention speedup and 7.3x higher throughput with minimal accuracy loss by dynamically selecting compression strategies for important KV cache elements.

Serving large language models (LLMs) with extremely long contexts, often hundreds of thousands to millions of tokens, faces a significant bottleneck: the Key-Value (KV) cache. This cache grows linearly with context length, consuming vast GPU memory, limiting batch sizes, and reducing serving throughput. While previous compression techniques targeted only one dimension (sequence or channel), they offer limited gains for truly long contexts, and combining them directly leads to substantial accuracy loss. This paper introduces MosaicKV, a novel dynamic two-dimensional (2D) KV cache compression system specifically engineered for extremely long-context LLM serving. MosaicKV tackles the accuracy challenge by exploiting the non-uniform importance distribution within the KV cache. Instead of applying a uniform compression pattern, it dynamically identifies critical elements for each KV vector and selects appropriate compression strategies at a fine-grained segment level. To manage the overhead of fine-grained compression, MosaicKV incorporates a compressed KV cache management mechanism. This system intelligently utilizes underutilized GPU and CPU resources to maintain compressed caches and accelerate attention computation. Evaluations on an H800 GPU with multiple LLMs demonstrate impressive results: MosaicKV delivers up to 16x attention speedup, 4.8x lower decode latency, and 7.3x higher throughput, all while reducing memory usage by 3x and incurring only a minimal 1.76% average accuracy loss on standard benchmarks.

Why it matters

For professionals deploying and scaling long-context LLMs, MosaicKV offers a critical solution to memory constraints and performance bottlenecks, enabling more efficient and cost-effective serving of advanced AI applications.

How to implement this in your domain

1Evaluate MosaicKV's potential for reducing GPU memory footprint and increasing throughput in existing long-context LLM serving infrastructure.
2Investigate integrating dynamic two-dimensional KV cache compression into custom LLM inference engines.
3Benchmark MosaicKV's performance and accuracy trade-offs against current KV cache management strategies for specific LLM workloads.
4Consider optimizing hardware resource utilization by leveraging underutilized GPU/CPU resources for compressed KV cache management as proposed by MosaicKV.

Who benefits

Cloud ComputingAI/ML InfrastructureSoftware DevelopmentData CentersTelecommunications

Key takeaways

MosaicKV significantly reduces KV cache memory usage for long-context LLMs.
It achieves substantial speedups in attention computation and higher serving throughput.
Dynamic two-dimensional compression minimizes accuracy loss by targeting important KV cache elements.
The system optimizes resource utilization by managing compressed caches efficiently.

Original post by Sheng Qiang, Ruiwei Chen, Yinpeng Wu, Jinyu Gu, Zhichao Hua, Yubin Xia, Binyu Zang, Haibo Chen

"arXiv:2607.00760v1 Announce Type: new Abstract: Long-context LLM services now sustain prompts with hundreds of thousands to millions of tokens, making the key-value (KV) cache a first-order serving cost. Because the cache grows linearly with context length, it can exhaust GPU mem…"

View on X

Originally posted by Sheng Qiang, Ruiwei Chen, Yinpeng Wu, Jinyu Gu, Zhichao Hua, Yubin Xia, Binyu Zang, Haibo Chen on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses

MosaicKV Compresses LLM KV Cache for Long Contexts

Why it matters

How to implement this in your domain

Who benefits

Key takeaways

Want to go deeper?

More in AI Engineering & DevTools

Keynotes on Sandboxing and World Models Receive High Praise

Human Feedback Guides Generative Meta-Learning for Robust Generalization.

Valdi: Value Diffusion World Models for MPC