Excessive LLM Sampling Can Worsen Answer Quality and Waste Compute

Yong Yi Bay, Kathleen A. Yearick· June 30, 2026 View original

Summary

This paper reveals that while increased sampling (test-time scaling) can improve the coverage of correct answers by LLMs, it often leads to diminishing returns and can even degrade the final selected answer. Beyond a certain point, extra samples only make the model more confident in a wrong answer, highlighting the "modal ceiling" and "correlation ceiling" for effective sampling.

Large Language Models (LLMs) often employ test-time scaling, where a question is sampled multiple times to increase the chance of generating a correct answer. While this strategy can improve the *coverage* (the likelihood of at least one correct answer appearing), this research demonstrates that it doesn't necessarily improve the *selection* of the best answer. In fact, excessive sampling can be detrimental. The study introduces the concepts of the "modal ceiling" and "correlation ceiling." The modal ceiling indicates that for selecting a single answer, the consensus (vote) often settles within a few dozen samples. Beyond this point, additional samples primarily increase the model's confidence in a potentially incorrect answer, rather than improving accuracy. The correlation ceiling applies to benchmark scoring, where the effective number of samples is even lower. This phenomenon, termed the "identifiability gap," highlights that LLMs can generate correct answers they cannot reliably pick. The key takeaway is that the bottleneck isn't generation, but recognition. Over-sampling wastes computational resources and can lead to worse outcomes by reinforcing confident mistakes, suggesting that optimal sampling is far less extensive than commonly assumed.

Why it matters

For professionals deploying LLMs, understanding these ceilings is crucial for optimizing computational costs and improving the reliability of single-answer outputs. It prevents over-engineering and ensures resources are allocated effectively.

How to implement this in your domain

  1. 1Analyze your LLM's test-time scaling strategies to identify the "modal ceiling" for your specific tasks.
  2. 2Implement dynamic sampling cutoffs that stop generating samples once a clear consensus or sufficient confidence is reached.
  3. 3Prioritize improving the selection mechanism (how the best answer is chosen) rather than simply increasing the number of generated samples.
  4. 4Monitor the effective number of samples needed for consistent performance to optimize resource allocation.

Who benefits

AI/ML EngineeringSoftware DevelopmentCloud ComputingData Science

Key takeaways

  • More LLM sampling does not always lead to better final answers; it can increase confidence in wrong ones.
  • The "modal ceiling" suggests optimal answer selection often occurs within a few dozen samples.
  • The "identifiability gap" means LLMs can generate correct answers but struggle to pick them.
  • Focus on improving answer selection mechanisms rather than just increasing sample generation to save compute and improve accuracy.

Original post by Yong Yi Bay, Kathleen A. Yearick

"arXiv:2606.28661v1 Announce Type: new Abstract: People overthink; language models over-sample, and the extra effort can talk both into a worse answer. Reasoning systems answer a hard question by sampling it many times (test-time scaling), and the more they draw, the more often a…"

View on X

Originally posted by Yong Yi Bay, Kathleen A. Yearick on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses