New Framework Quantifies LLM Logical Reasoning Consistency

Baishali Chaudhury, Mengdie Flora Wang, Hyunji Hayley Park, Rahul Ghosh, Sungmin Hong, Jae Oh Woo· June 17, 2026 View original

Summary

A new framework, structural uncertainty, quantifies consistency in LLM logical reasoning by assessing the stability of self-preference-induced rankings over sampled reasoning solutions. It decomposes consistency into across-trial ranking instability and within-trial candidate ambiguity, providing complementary insights to output dispersion.

Large Language Models (LLMs) often produce correct answers through unstable or contradictory reasoning paths, particularly in multi-step deductive tasks. Traditional methods for assessing reliability primarily focus on output dispersion, which measures how much sampled answers vary. Researchers have introduced "structural uncertainty," a novel framework that evaluates consistency by analyzing the stability of an LLM's self-preferences among its own generated reasoning candidates. This involves generating multiple solutions for a query, then asking the model to pairwise rank its own outputs. The framework aggregates these self-preferences into ranking distributions and decomposes the signal into two entropy-based components: across-trial ranking instability and within-trial candidate ambiguity. Experiments across various LLMs and benchmarks show that structural signals offer complementary information to answer dispersion, especially for logical and mathematical reasoning tasks, improving the identification of unreliable instances.

Why it matters

This research provides a more nuanced and effective way to evaluate the reliability and consistency of LLM reasoning, which is crucial for deploying AI systems in critical applications where trust in the reasoning process is paramount.

How to implement this in your domain

  1. 1Integrate structural uncertainty metrics into LLM evaluation pipelines for critical applications.
  2. 2Use the framework to diagnose reasoning consistency issues in multi-step LLM tasks.
  3. 3Develop LLM fine-tuning strategies that prioritize consistent reasoning paths over mere output accuracy.
  4. 4Apply structural uncertainty to compare and select LLMs for tasks requiring high logical fidelity.

Who benefits

AI/ML ResearchSoftware DevelopmentQuality AssuranceFinanceHealthcare

Key takeaways

  • LLMs can achieve correct answers via inconsistent reasoning paths.
  • Structural uncertainty quantifies reasoning consistency via self-preference rankings.
  • It offers complementary insights to traditional output dispersion metrics.
  • Across-trial instability signals unreliable reasoning, while within-trial ambiguity can correlate with correctness.

Original post by Baishali Chaudhury, Mengdie Flora Wang, Hyunji Hayley Park, Rahul Ghosh, Sungmin Hong, Jae Oh Woo

"arXiv:2606.17312v1 Announce Type: new Abstract: Large language models can arrive at the same answer through reasoning paths that are unstable, contradictory, or difficult to rank consistently -- a failure mode especially prevalent in multi-step deductive reasoning. Existing metho…"

View on X

Originally posted by Baishali Chaudhury, Mengdie Flora Wang, Hyunji Hayley Park, Rahul Ghosh, Sungmin Hong, Jae Oh Woo on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses