New Framework Enhances LLM Reasoning by Optimizing Token Distributions

Xuanzhi Feng, Zhengyang Li, Zeyu Liu, Haoxi Li, Yuming Jiang, Bing Guo, Jingcai Guo, Jie Zhang, Song Guo· June 19, 2026 View original

Summary

This paper introduces the Independent Combinatorial Tokens (ICT) framework, which improves Large Language Model (LLM) reasoning by focusing on token-level distributional deviations rather than scalar uncertainty. ICT uses Jensen-Shannon divergence to identify critical tokens, preventing entropy collapse or explosion and stabilizing training.

Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced Large Language Model (LLM) reasoning, but it faces a core instability: uniform token updates can lead to "entropy collapse," causing premature convergence to suboptimal strategies, while excessive entropy maximization can result in "entropy explosion," leading to incoherent reasoning. This research proposes a solution to this dilemma. The Independent Combinatorial Tokens (ICT) framework shifts the optimization focus from simple scalar uncertainty measures to the distributional properties of token logits. By employing Jensen-Shannon (JS) divergence, ICT identifies tokens with distinct distributional patterns, which are then used as critical branching points to guide more effective exploration during LLM reasoning. Theoretical analysis confirms that selectively updating these specific tokens regulates policy concentration, reducing overall distribution uncertainty while controlling probability concentration. This dual effect prevents both over-concentrated token generation that stifles exploration and stabilizes the training landscape. Empirical results on Qwen2.5 models demonstrate significant improvements in reasoning benchmarks, with an average pass@4 gain of 4.58% and a maximum of 14.9% over existing baselines.

Why it matters

This advancement offers a more stable and effective method for training LLMs, leading to improved reasoning capabilities and more robust performance in complex problem-solving tasks, directly impacting AI development and application.

How to implement this in your domain

  1. 1Investigate the ICT framework for fine-tuning or training custom LLMs to improve reasoning.
  2. 2Apply Jensen-Shannon divergence or similar distributional metrics to analyze token logits in LLM outputs.
  3. 3Implement selective token updating strategies to prevent entropy collapse or explosion during LLM training.
  4. 4Benchmark LLM reasoning performance using diverse problem sets to evaluate the impact of advanced optimization techniques.

Who benefits

AI ResearchSoftware DevelopmentData ScienceEducationFinance

Key takeaways

  • ICT framework improves LLM reasoning by focusing on token-level distributional deviations.
  • It uses Jensen-Shannon divergence to identify critical tokens for effective exploration.
  • The method prevents both entropy collapse and entropy explosion during training.
  • Empirical results show significant improvements in LLM reasoning benchmarks.

Original post by Xuanzhi Feng, Zhengyang Li, Zeyu Liu, Haoxi Li, Yuming Jiang, Bing Guo, Jingcai Guo, Jie Zhang, Song Guo

"arXiv:2606.19771v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced Large Language Model (LLM) reasoning; however, it faces a fundamental optimization instability: uniform token updates precipitate entropy collapse, lea…"

View on X

Originally posted by Xuanzhi Feng, Zhengyang Li, Zeyu Liu, Haoxi Li, Yuming Jiang, Bing Guo, Jingcai Guo, Jie Zhang, Song Guo on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses