arxiv:2603.07700

TDM-R1: Reinforcing Few-Step Diffusion Models with Non-Differentiable Reward

Published on Mar 8

· Submitted by

Yihong Luo on Mar 10

· HKUST

Upvote

Authors:

Yihong Luo ,

Abstract

A novel reinforcement learning approach called TDM-R1 is introduced to enhance few-step generative models by incorporating non-differentiable rewards through surrogate reward learning and generator learning decoupling.

AI-generated summary

While few-step generative models have enabled powerful image and video generation at significantly lower cost, generic reinforcement learning (RL) paradigms for few-step models remain an unsolved problem. Existing RL approaches for few-step diffusion models strongly rely on back-propagating through differentiable reward models, thereby excluding the majority of important real-world reward signals, e.g., non-differentiable rewards such as humans' binary likeness, object counts, etc. To properly incorporate non-differentiable rewards to improve few-step generative models, we introduce TDM-R1, a novel reinforcement learning paradigm built upon a leading few-step model, Trajectory Distribution Matching (TDM). TDM-R1 decouples the learning process into surrogate reward learning and generator learning. Furthermore, we developed practical methods to obtain per-step reward signals along the deterministic generation trajectory of TDM, resulting in a unified RL post-training method that significantly improves few-step models' ability with generic rewards. We conduct extensive experiments ranging from text-rendering, visual quality, and preference alignment. All results demonstrate that TDM-R1 is a powerful reinforcement learning paradigm for few-step text-to-image models, achieving state-of-the-art reinforcement learning performances on both in-domain and out-of-domain metrics. Furthermore, TDM-R1 also scales effectively to the recent strong Z-Image model, consistently outperforming both its 100-NFE and few-step variants with only 4 NFEs. Project page: https://github.com/Luo-Yihong/TDM-R1

View arXiv page View PDF Project page GitHub 30 Add to collection

Community

Luo-Yihong

Paper author Paper submitter 1 day ago

•

edited 1 day ago

Although few-step diffusion models have achieved efficient generation, they still fall short on tasks such as complex instruction following and text rendering. Existing reinforcement learning methods rely heavily on differentiable reward signals, making it difficult to leverage key non-differentiable feedback such as human preferences and object counting. This work proposes TDM-R1, which for the first time enables large-scale reinforcement learning on few-step diffusion models using non-differentiable rewards. With only 4 sampling steps, it boosts the GenEval score from 61% to 92%, substantially surpassing both the 40-step base model (63%) and GPT-4o (84%).