Dynamic Support Learning Enhances Reinforcement Learning Value Estimation

Jen-Yen Chang, Takayuki Osa, Tatsuya Harada· July 3, 2026 View original

▶ The 2-minute explainer

Summary

This paper introduces an approach that dynamically learns the lower and upper bounds of support intervals for categorical critics in reinforcement learning, improving value function estimation. The method, which forms a tighter upper bound on the mean-squared Bellman error, enhances stability and performance on continuous-control tasks without requiring pre-defined support intervals.

Traditional reinforcement learning (RL) often estimates value functions using regression, but distributional RL models a distribution of returns. A recent technique, Gaussian Histogram Loss (HL-Gauss), reframes value estimation as a classification problem by encoding scalar Bellman targets into Gaussian-smoothed categorical targets. However, a key challenge with histogram-based losses in RL is the need to pre-define a fixed support interval, which is difficult given the non-stationary and stochastic nature of RL target values. This research proposes a novel method that dynamically learns the lower and upper bounds of this support interval, rather than setting them beforehand. The objective function jointly optimizes these bounds with the categorical representation, forming a tighter upper bound on the mean-squared Bellman error. Empirical results show that this approach leads to more stable adaptation of the support interval and matches or improves performance on various continuous-control tasks compared to HL-Gauss, all without requiring prior specification of the support range.

Why it matters

This advancement offers a more robust and adaptable way to estimate value functions in reinforcement learning, potentially leading to more stable and efficient training of AI agents in complex environments.

How to implement this in your domain

  1. 1Explore integrating dynamic support learning into existing reinforcement learning frameworks for improved value estimation.
  2. 2Test the proposed method on specific continuous-control tasks to assess its performance benefits compared to fixed-support approaches.
  3. 3Analyze the stability and convergence properties of RL agents trained with this dynamic support learning technique.
  4. 4Consider adapting this approach for applications requiring robust and adaptive value function approximation, such as robotics or autonomous systems.

Who benefits

RoboticsAutonomous VehiclesGamingLogisticsAI Development

Key takeaways

  • Dynamically learning support intervals improves categorical critic performance in RL.
  • The new method offers a tighter bound on Bellman error, enhancing stability.
  • It eliminates the need for pre-defining fixed support intervals, simplifying RL setup.
  • Improved value estimation can lead to more efficient and robust RL agents.

Original post by Jen-Yen Chang, Takayuki Osa, Tatsuya Harada

"arXiv:2607.01880v1 Announce Type: new Abstract: Value functions are an essential component in actor-critic based deep reinforcement learning (RL). Conventionally, these functions are trained as a regression task by minimising the mean squared error (MSE) relative to bootstrapped…"

View on X

Originally posted by Jen-Yen Chang, Takayuki Osa, Tatsuya Harada on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses