QPILOTS Enhances Flow Policies with Test-Time Q-Steering

Yifan Ruan, Chenyang Cao, Andreas Burger, Ali Pesaranghader, Kaveh Kamali, Jaehong Kim, Nandita Vijaykumar, Alan Aspuru-Guzik, Igor Gilitschenski, Nicholas Rhinehart· June 16, 2026 View original

Summary

QPILOTS is a new method that improves flow-matching and diffusion policies in reinforcement learning by steering the denoising process at inference time. It achieves this by projecting intermediate states to estimate final clean actions and computing critic gradients, leading to superior performance in offline-to-online RL benchmarks and manipulation tasks.

This paper introduces QPILOTS, a novel approach designed to enhance the performance of flow-matching and diffusion policies in reinforcement learning (RL). These policies are powerful for generating actions, but their optimization using temporal-difference RL has been challenging due to numerical instability when backpropagating critic gradients through multi-step denoising. Existing solutions often compromise by discarding gradient information or requiring repeated policy fine-tuning. QPILOTS circumvents these issues by leaving the original policy untouched and instead steering the denoising process during inference. At each denoising step, it projects the noisy intermediate action to an estimated final clean action, where critic predictions are more reliable, and then computes the critic gradient. This technique, with variants like QPILOTS-U and QPILOTS-M, demonstrates significant improvements, achieving an average success rate of 90% across 50 tasks in a standard offline-to-online RL benchmark. Furthermore, QPILOTS successfully steers large, frozen Vision-Language Action (VLA) foundation models, matching or surpassing other inference-time methods in various manipulation tasks.

Why it matters

For AI engineers and researchers working on complex robotic control, generative models, or reinforcement learning, QPILOTS offers a more stable and efficient way to optimize and deploy advanced action generation policies. It enables better performance without the computational overhead of repeated training.

How to implement this in your domain

  1. 1Investigate QPILOTS for improving existing flow-matching or diffusion policies in RL applications.
  2. 2Integrate QPILOTS into robotic control systems to enhance action generation and task success rates.
  3. 3Apply QPILOTS to steer large, pre-trained foundation models for specific manipulation or control tasks.
  4. 4Benchmark QPILOTS against current policy optimization methods in your specific domain to assess performance gains.

Who benefits

RoboticsAutonomous SystemsAI/ML DevelopmentManufacturingGaming

Key takeaways

  • QPILOTS improves flow-matching and diffusion policies by steering the denoising process at inference time.
  • It addresses numerical instability in RL optimization without modifying the original policy.
  • The method achieves high success rates in offline-to-online RL benchmarks.
  • QPILOTS can effectively steer large, frozen foundation models for complex tasks.

Original post by Yifan Ruan, Chenyang Cao, Andreas Burger, Ali Pesaranghader, Kaveh Kamali, Jaehong Kim, Nandita Vijaykumar, Alan Aspuru-Guzik, Igor Gilitschenski, Nicholas Rhinehart

"arXiv:2606.14801v1 Announce Type: new Abstract: Flow-matching and diffusion policies are expressive action generators, but optimizing them with temporal-difference reinforcement learning (RL) remains difficult. Effective policy extraction requires exploiting the critic's action g…"

View on X

Originally posted by Yifan Ruan, Chenyang Cao, Andreas Burger, Ali Pesaranghader, Kaveh Kamali, Jaehong Kim, Nandita Vijaykumar, Alan Aspuru-Guzik, Igor Gilitschenski, Nicholas Rhinehart on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses