Counterexamples and Fix for Monte Carlo Exploring Starts

Octave Oliviers, Glenn Vinnicombe· June 16, 2026 View original

Summary

This paper presents counterexamples demonstrating that Monte Carlo Exploring Starts (MCES) can converge to suboptimal solutions in reinforcement learning, even in tabular settings. It proposes a convergence-restoring modification for initial-visit MCES by scaling learning rates inversely to update frequencies, guaranteeing optimality.

The asymptotic behavior of Monte Carlo Exploring Starts (MCES) in reinforcement learning has long been an unresolved question, even for tabular settings. This research investigates the convergence properties of tabular MCES and constructs specific examples where the algorithm fails to converge to optimal solutions, instead settling on suboptimal ones. The paper provides new counterexamples for both initial-visit and first-visit MCES. It shows that stable suboptimal solutions can exist for initial-visit MCES, even when greedy actions are updated more frequently than non-greedy ones. Crucially, the study also offers a modification that restores convergence to optimality for the initial-visit case. This fix involves scaling learning rates inversely to update frequencies on a state-by-state basis. Unlike previous uniformization methods, this modification is applicable to large-scale problems where value functions are approximated. These findings largely resolve a fundamental open problem, emphasizing that exploring starts alone do not guarantee optimality and that the choice of learning rates and the balance between exploration and exploitation are critical.

Why it matters

This research clarifies a fundamental theoretical limitation of a widely used reinforcement learning algorithm and provides a practical solution, ensuring that practitioners can achieve optimal policies when using Monte Carlo Exploring Starts.

How to implement this in your domain

  1. 1Review the counterexamples to understand the conditions under which MCES can fail to converge optimally.
  2. 2Implement the proposed learning rate scaling modification for initial-visit MCES in your reinforcement learning projects.
  3. 3Evaluate the impact of this modification on the convergence and optimality of your agents in various environments.
  4. 4Consider how these insights into learning rates and update frequencies apply to other Monte Carlo control methods.

Who benefits

RoboticsGamingAutonomous SystemsOperations ResearchMachine Learning Research

Key takeaways

  • Monte Carlo Exploring Starts (MCES) can converge to suboptimal solutions.
  • Counterexamples are provided for both initial-visit and first-visit MCES.
  • A learning rate scaling modification guarantees optimality for initial-visit MCES.
  • Convergence depends critically on learning rates and update frequencies.

Original post by Octave Oliviers, Glenn Vinnicombe

"arXiv:2606.15247v1 Announce Type: new Abstract: The asymptotic behaviour of Monte Carlo Exploring Starts (MCES) is a long-standing open question in reinforcement learning, even in the tabular setting. We investigated the convergence properties of tabular MCES by constructing exam…"

View on X

Originally posted by Octave Oliviers, Glenn Vinnicombe on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses