Papers
arxiv:2603.07700

TDM-R1: Reinforcing Few-Step Diffusion Models with Non-Differentiable Reward

Published on Mar 8
· Submitted by
Yihong Luo
on Mar 10
Authors:
,
,

Abstract

A novel reinforcement learning approach called TDM-R1 is introduced to enhance few-step generative models by incorporating non-differentiable rewards through surrogate reward learning and generator learning decoupling.

AI-generated summary

While few-step generative models have enabled powerful image and video generation at significantly lower cost, generic reinforcement learning (RL) paradigms for few-step models remain an unsolved problem. Existing RL approaches for few-step diffusion models strongly rely on back-propagating through differentiable reward models, thereby excluding the majority of important real-world reward signals, e.g., non-differentiable rewards such as humans' binary likeness, object counts, etc. To properly incorporate non-differentiable rewards to improve few-step generative models, we introduce TDM-R1, a novel reinforcement learning paradigm built upon a leading few-step model, Trajectory Distribution Matching (TDM). TDM-R1 decouples the learning process into surrogate reward learning and generator learning. Furthermore, we developed practical methods to obtain per-step reward signals along the deterministic generation trajectory of TDM, resulting in a unified RL post-training method that significantly improves few-step models' ability with generic rewards. We conduct extensive experiments ranging from text-rendering, visual quality, and preference alignment. All results demonstrate that TDM-R1 is a powerful reinforcement learning paradigm for few-step text-to-image models, achieving state-of-the-art reinforcement learning performances on both in-domain and out-of-domain metrics. Furthermore, TDM-R1 also scales effectively to the recent strong Z-Image model, consistently outperforming both its 100-NFE and few-step variants with only 4 NFEs. Project page: https://github.com/Luo-Yihong/TDM-R1

Community

Paper author Paper submitter
edited 1 day ago

Although few-step diffusion models have achieved efficient generation, they still fall short on tasks such as complex instruction following and text rendering. Existing reinforcement learning methods rely heavily on differentiable reward signals, making it difficult to leverage key non-differentiable feedback such as human preferences and object counting. This work proposes TDM-R1, which for the first time enables large-scale reinforcement learning on few-step diffusion models using non-differentiable rewards. With only 4 sampling steps, it boosts the GenEval score from 61% to 92%, substantially surpassing both the 40-step base model (63%) and GPT-4o (84%).

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.07700 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.07700 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.