New TD(0) Algorithm Achieves Robust and Fast Rates with Single Stepsize.

Wei-Cheng Lee, Francesco Orabona· June 25, 2026 View original

Summary

This research introduces a novel linear TD(0) algorithm using Polyak-Ruppert averaging and a single stepsize schedule, providing high-probability guarantees for both robust curvature-free and fast curvature-dependent convergence rates. The method ensures iterates are uniformly bounded without projections, simplifying reinforcement learning optimization.

This paper presents an advancement in reinforcement learning, specifically for the linear TD(0) algorithm, which is a fundamental method for value function estimation. The key innovation is the use of a single, carefully chosen stepsize schedule in conjunction with Polyak-Ruppert averaging. This approach eliminates the need for complex projections or prior knowledge of curvature parameters, which are often required to ensure stability and convergence in traditional TD(0) implementations. The researchers demonstrate that this single stepsize not only guarantees that the algorithm's iterates remain bounded with high probability but also achieves a dual convergence benefit. It simultaneously provides a robust convergence rate that is independent of the curvature and a faster rate that leverages curvature information, effectively taking the best of both worlds. This is achieved through a novel technical framework involving Poisson-equation toolkits for Markov chains, which helps decompose noise and establish pathwise stability.

Why it matters

Professionals working with reinforcement learning algorithms can benefit from this simplified yet robust approach to TD(0), potentially leading to more stable and efficient training of agents without extensive hyperparameter tuning. It offers a theoretical foundation for improved practical implementations in areas like control systems and autonomous decision-making.

How to implement this in your domain

  1. 1Review the proposed stepsize schedule and Polyak-Ruppert averaging technique for TD(0) implementations.
  2. 2Experimentally apply this method in existing reinforcement learning environments where TD(0) is used for value estimation.
  3. 3Compare the stability and convergence speed against traditional TD(0) with projected updates or multiple stepsize tuning.
  4. 4Integrate the simplified TD(0) into custom agents for tasks requiring robust and efficient learning.

Who benefits

AI/ML DevelopmentRoboticsAutonomous SystemsFinancial Modeling

Key takeaways

  • A single stepsize schedule can simplify TD(0) optimization while ensuring robust performance.
  • Polyak-Ruppert averaging helps achieve both curvature-free and curvature-dependent fast convergence rates.
  • The method guarantees bounded iterates without the need for explicit projections.
  • This research offers a more stable and efficient approach to value function estimation in reinforcement learning.

Original post by Wei-Cheng Lee, Francesco Orabona

"arXiv:2606.24981v1 Announce Type: new Abstract: We study linear TD(0) under Markovian sampling, where data are generated along a single trajectory. We provide high-probability guarantees for a plain unprojected TD(0) algorithm with Polyak-Ruppert (PR) averaging, using a single st…"

View on X

Originally posted by Wei-Cheng Lee, Francesco Orabona on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses