NormGuard Preserves Image Quality in RL-Tuned Flow-Based Generators.

Tianlin Pan, Lianyu Pang, Cheng Da, Huan Yang, Changqian Yu, Kun Gai, Wenhan Luo· June 29, 2026 View original

Summary

This paper introduces NormGuard, a hinge penalty that prevents velocity norm inflation during reinforcement learning (RL) post-training of flow-based generative models. It consistently improves MLLM-judged image quality and forensic realism while preserving reward, addressing a common issue where RL fine-tuning degrades perceptual quality.

Reinforcement learning (RL) is often used to fine-tune flow-based generative models, improving their alignment with specific rewards. However, this post-training process frequently leads to a degradation in the perceptual quality of the generated outputs, which is not adequately captured by the reward proxy. A key finding is that RL fine-tuning tends to inflate the per-step velocity norm of the model, a structural signature of this quality drift. Existing inference-time corrections, such as rescaling the velocity norm, have been explored in other contexts but do not effectively transfer to RL. This is because the norm inflation becomes co-adapted into the model's weights, and simple rescaling at inference time neither improves the reward nor fixes the quality degradation. Furthermore, an analysis shows that velocity magnitude rescaling does not carry a consistent first-order reward signal at the batch level, suggesting that merely suppressing norm inflation won't inherently remove a reward-carrying component. These insights led to the development of NormGuard, a training-time intervention. NormGuard is a hinge penalty that activates only when the model's velocity norm exceeds a reference norm, and it is added to any velocity-local base loss. Experiments across various base models, post-training methods, and reward proxies demonstrate that NormGuard consistently enhances MLLM-judged image quality and forensic realism while fully preserving the intended reward. Its benefits are particularly amplified under few-step inference, and these improvements are not simply due to early stopping.

Why it matters

Professionals developing or deploying generative AI models, especially for image or video synthesis, can use NormGuard to maintain high perceptual quality while still benefiting from RL-based reward alignment.

How to implement this in your domain

  1. 1Identify instances where RL post-training of generative models leads to perceptual quality degradation.
  2. 2Integrate NormGuard's hinge penalty into the training loss function of flow-based generative models.
  3. 3Establish a reference velocity norm for the base model to guide the NormGuard penalty.
  4. 4Evaluate the impact of NormGuard on both reward alignment and perceptual quality metrics (e.g., MLLM-judged scores).

Who benefits

Creative ArtsEntertainmentGamingAI/TechDigital Media

Key takeaways

  • RL post-training of flow-based generators can degrade perceptual quality due to velocity norm inflation.
  • Inference-time corrections are ineffective as inflation is co-adapted into model weights.
  • NormGuard is a training-time hinge penalty that prevents norm inflation.
  • It improves image quality and realism while preserving reward, especially with few-step inference.

Original post by Tianlin Pan, Lianyu Pang, Cheng Da, Huan Yang, Changqian Yu, Kun Gai, Wenhan Luo

"arXiv:2606.27771v1 Announce Type: new Abstract: Reinforcement learning (RL) post-training improves the reward alignment of flow-based generators, but often degrades perceptual quality in ways that are not captured by the reward proxy. We identify a simple structural signature of…"

View on X

Originally posted by Tianlin Pan, Lianyu Pang, Cheng Da, Huan Yang, Changqian Yu, Kun Gai, Wenhan Luo on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses