Nexus Sampling Improves LLM KV-Cache Eviction for Long Contexts

Duc Duong, Hoang Anh Duy Le, Jianwen Xie, Anshumali Shrivastava, Zhaozhuo Xu· June 24, 2026 View original

▶ The 2-minute explainer

Summary

Nexus Sampling is a new training-free method for KV-cache eviction in LLMs that uses an iterative scoring mechanism and weighted reservoir sampling. It outperforms deterministic top-K methods by retaining subtly important tokens, crucial for long-context and agentic workloads.

Large Language Models (LLMs) handling long contexts or agentic tasks frequently exceed fixed memory budgets for their Key-Value (KV) cache, necessitating continuous token eviction. Existing eviction strategies typically rely on a direct attention score followed by deterministic top-K selection, which can inadvertently discard subtly important tokens. This new research introduces Nexus Sampling to address this limitation. Nexus Sampling combines two key innovations: Nexus scoring, an iterative process that identifies "bridge tokens" by walking over direct attention scores, and weighted reservoir sampling, which assigns inclusion probabilities to tokens instead of making irreversible deterministic decisions. Theoretically, Nexus Sampling is shown to improve the long-term survival of crucial tokens. Empirically, it matches dense attention performance on LongBench tasks while significantly outperforming top-K baselines on retrieval-heavy tasks, demonstrating up to a 10x reduction in per-sequence cache memory.

Why it matters

For AI engineers and developers building LLM applications, especially those requiring long context windows or agentic capabilities, Nexus Sampling offers a critical improvement in memory efficiency and performance. It ensures that important information is retained, leading to more accurate and reliable LLM outputs without compromising on context length.

How to implement this in your domain

  1. 1Integrate Nexus Sampling into LLM inference stacks to manage KV-cache eviction for long-context applications.
  2. 2Benchmark Nexus Sampling against existing top-K eviction methods for specific retrieval-heavy LLM tasks.
  3. 3Optimize LLM deployments by leveraging Nexus Sampling to reduce cache memory requirements without sacrificing performance.
  4. 4Explore adapting Nexus Sampling for other memory management challenges in large-scale AI models.

Who benefits

AI DevelopmentSoftware EngineeringCustomer ServiceContent CreationResearch

Key takeaways

  • Nexus Sampling is a new training-free method for LLM KV-cache eviction.
  • It uses iterative Nexus scoring and weighted reservoir sampling.
  • The method retains subtly important tokens better than deterministic top-K.
  • It significantly improves performance on retrieval-heavy tasks and reduces memory usage.

Original post by Duc Duong, Hoang Anh Duy Le, Jianwen Xie, Anshumali Shrivastava, Zhaozhuo Xu

"arXiv:2606.23961v1 Announce Type: new Abstract: Long-context and agentic LLM workloads push the KV cache past any fixed memory budget, forcing the inference stack to permanently evict tokens at every step of a continuous-inference stream. Existing methods all share the same templ…"

View on X

Originally posted by Duc Duong, Hoang Anh Duy Le, Jianwen Xie, Anshumali Shrivastava, Zhaozhuo Xu on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses