Reversal Q-Learning Improves Offline Reinforcement Learning Performance

Aditya Oberai, Seohong Park, Sergey Levine· June 17, 2026 View original

Summary

This paper introduces Reversal Q-learning (RQL), a new off-policy reinforcement learning algorithm that trains a flow policy based on prior data. RQL uses virtual on-policy trajectories generated by "reversing" flows and applies bias-and-variance reduction to mitigate the curse of horizon, outperforming state-of-the-art flow-based offline RL algorithms.

Iterative generative modeling techniques, such as flow matching, are powerful tools for modeling complex behaviors, which is crucial for effective offline reinforcement learning (RL). This research proposes a novel off-policy RL algorithm called Reversal Q-learning (RQL), designed to train a flow policy using existing prior data. RQL operates within an "expanded" Markov decision process (MDP) framework, where individual flow refinement steps are treated as distinct actions. To enable off-policy RL in this context, the algorithm employs two key techniques: it generates virtual on-policy trajectories by "reversing" flows, making the framework compatible with prior data, and it applies a bias-and-variance reduction method to address the curse of horizon inherent in off-policy RL. The resulting RQL algorithm offers several advantages over previous flow-based RL methods, including avoiding backpropagation through time, making more effective use of the learned value function, and directly training a comprehensive, expressive flow policy. Experiments across 50 challenging simulated robotic tasks demonstrate that RQL achieves superior average offline RL performance compared to other state-of-the-art flow-based algorithms.

Why it matters

For AI engineers and researchers developing autonomous systems, robotics, or complex decision-making agents, RQL offers a significant advancement in offline reinforcement learning. It enables more effective learning from existing datasets, reducing the need for costly and time-consuming online data collection, and leading to more robust and performant policies.

How to implement this in your domain

  1. 1Investigate integrating Reversal Q-learning into existing offline reinforcement learning pipelines for robotics and control.
  2. 2Apply RQL to leverage large datasets of historical interactions to train more effective policies without online exploration.
  3. 3Explore the use of flow-based generative models in conjunction with RQL for complex behavior modeling.
  4. 4Benchmark RQL against current state-of-the-art offline RL algorithms for specific application domains to assess performance gains.

Who benefits

RoboticsAutonomous SystemsLogisticsManufacturingHealthcare (e.g., treatment planning)

Key takeaways

  • Reversal Q-learning (RQL) is a new off-policy RL algorithm for training flow policies.
  • It uses virtual on-policy trajectories and bias-variance reduction.
  • RQL avoids backpropagation through time and better utilizes value functions.
  • It achieves superior offline RL performance on challenging robotic tasks.

Original post by Aditya Oberai, Seohong Park, Sergey Levine

"arXiv:2606.17551v1 Announce Type: new Abstract: Iterative generative modeling techniques, such as flow matching, provide powerful tools to model complex behaviors for effective offline reinforcement learning (RL). In this work, we propose a new off-policy RL algorithm that trains…"

View on X

Originally posted by Aditya Oberai, Seohong Park, Sergey Levine on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses