TDM-R1: Reinforcing Few-Step Diffusion Models with Non-Differentiable Reward
Abstract
A novel reinforcement learning approach called TDM-R1 is introduced to enhance few-step generative models by incorporating non-differentiable rewards through surrogate reward learning and generator learning decoupling.
While few-step generative models have enabled powerful image and video generation at significantly lower cost, generic reinforcement learning (RL) paradigms for few-step models remain an unsolved problem. Existing RL approaches for few-step diffusion models strongly rely on back-propagating through differentiable reward models, thereby excluding the majority of important real-world reward signals, e.g., non-differentiable rewards such as humans' binary likeness, object counts, etc. To properly incorporate non-differentiable rewards to improve few-step generative models, we introduce TDM-R1, a novel reinforcement learning paradigm built upon a leading few-step model, Trajectory Distribution Matching (TDM). TDM-R1 decouples the learning process into surrogate reward learning and generator learning. Furthermore, we developed practical methods to obtain per-step reward signals along the deterministic generation trajectory of TDM, resulting in a unified RL post-training method that significantly improves few-step models' ability with generic rewards. We conduct extensive experiments ranging from text-rendering, visual quality, and preference alignment. All results demonstrate that TDM-R1 is a powerful reinforcement learning paradigm for few-step text-to-image models, achieving state-of-the-art reinforcement learning performances on both in-domain and out-of-domain metrics. Furthermore, TDM-R1 also scales effectively to the recent strong Z-Image model, consistently outperforming both its 100-NFE and few-step variants with only 4 NFEs. Project page: https://github.com/Luo-Yihong/TDM-R1
Community
Although few-step diffusion models have achieved efficient generation, they still fall short on tasks such as complex instruction following and text rendering. Existing reinforcement learning methods rely heavily on differentiable reward signals, making it difficult to leverage key non-differentiable feedback such as human preferences and object counting. This work proposes TDM-R1, which for the first time enables large-scale reinforcement learning on few-step diffusion models using non-differentiable rewards. With only 4 sampling steps, it boosts the GenEval score from 61% to 92%, substantially surpassing both the 40-step base model (63%) and GPT-4o (84%).
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Euphonium: Steering Video Flow Matching via Process Reward Gradient Guided Stochastic Dynamics (2026)
- Know Your Step: Faster and Better Alignment for Flow Matching Models via Step-aware Advantages (2026)
- DenseGRPO: From Sparse to Dense Reward for Flow Matching Model Alignment (2026)
- Reward-Forcing: Autoregressive Video Generation with Reward Feedback (2026)
- Bidirectional Reward-Guided Diffusion for Real-World Image Super-Resolution (2026)
- EasyTune: Efficient Step-Aware Fine-Tuning for Diffusion-Based Motion Generation (2026)
- Diffusion Alignment Beyond KL: Variance Minimisation as Effective Policy Optimiser (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper