Diversity-Preserved Distribution Matching Distillation for Fast Visual Synthesis
1City University of Hong Kong 2The Hong Kong Polytechnic University 3OPPO Research Institute
No perceptual backbone, no discriminator, no auxiliary networks, and no additional ground-truth images.
βοΈ Installation
To set up the environment, we recommend using Conda to manage dependencies. Follow these steps to get started for training:
conda create -n dpdmd python=3.10.16
conda activate dpdmd
pip install -e .
During training, your model will be evaluated using DINOv2, CLIP, ImageReward, and PickScore metrics, all of which are available in the installed dpdmd environment above.
π Attention1 [new env name: test_div]: DINOv3 requires transformers >= 4.57.0, which is incompatible with the ImageReward metric. Therefore, it is recommended to use DINOv2 during training. If you need to evaluate with DINOv3 after training, please create a separate conda environment and upgrade the transformers version accordingly.
π Attention2 [new env name: vq]: For visual quality evaluation, please follow VisualQuality-R1. After setup, install timm via pip install timm to enable the MANIQA metric. Creating a new environment for this step is very simple and recommended.
π Overall, three separate environments may be required: one for training and human preference evaluation (ImageReward, PickScore, DINOv2 and CLIP), one for visual quality evaluation (VisualQuality-R1 and MANIQA), and one for diversity evaluation (DINOv3 and CLIP). If DINOv3 is not used, only two environments are needed: one for training (including human preference evaluation) and one for visual quality evaluation.
β‘ Quick Inference
Run the following code to generate an image (Hugging Face model is trained SD3.5-M Transformer).
import torch
from diffusers import StableDiffusion3Pipeline
base_sd35_weight_path = "stable-diffusion-3.5-medium" # SD3.5-Medium weight path
transformer_weight_path = "DPDMD-SD35M-4NFE-natural.pt" # SD3.5-Medium Transformer weight path
pipe = StableDiffusion3Pipeline.from_pretrained(base_sd35_weight_path, torch_dtype=torch.bfloat16)
state_dict = torch.load(f"{transformer_weight_path}", map_location="cpu")
missing, unexpected = pipe.transformer.load_state_dict(state_dict, strict=True)
pipe = pipe.to("cuda:0")
g_init = torch.Generator(device="cuda:0").manual_seed(5)
image = pipe(
"a dog",
num_inference_steps=4,
guidance_scale=1.0,
height=1024,
width=1024,
generator=g_init
).images[0]
save_path = "./demo.png"
image.save(save_path)
π Training
Starting the training process is very very easy. Please follow the three steps below.
Data Preparation
We only use text prompts for training. Example prompts can be found in the data/ folder (one text prompt per line). All prompts are stored in .txt format.
Pretrainined Model Preparation
Before starting training, you should first download the required files:
- [SD3.5 Medium] stable-diffusion-3.5-medium
- [PickScore processor] CLIP-ViT-H-14-laion2B-s32B-b79K
- [PickScore] PickScore_v1
- [ImageReward] ImageReward
- [DINOv2] dinov2-base
- [CLIP] clip-vit-large-patch14
Then you should modify the weight path in training script which is located at scripts/run_train_sd35.sh.
--teacher_id weights/stabilityai/stable-diffusion-3.5-medium \
--student_id weights/stabilityai/stable-diffusion-3.5-medium \
--fake_id weights/stabilityai/stable-diffusion-3.5-medium \
--pick_processor_path weights/CLIP-ViT-H-14-laion2B-s32B-b79K \
--pick_model_path weights/PickScore_v1 \
--ir_model_path weights/ImageReward/ImageReward.pt \
--ir_med_config weights/ImageReward/med_config.json \
--dino_path weights/dinov2-base \
--clip_path weights/clip-vit-large-patch14 \
Start Training
π Attention: When starting a training experiment, you should keepsd35_dpdmd/sd35m_t30_1024_lr1e5_4nfe_anchor5 (example) consistent across the following arguments to ensure that all generated files are stored under the same root folder.
--log_path outputs/sd35_dpdmd/sd35m_t30_1024_lr1e5_4nfe_anchor5/log \
--ckpt_dir outputs/sd35_dpdmd/sd35m_t30_1024_lr1e5_4nfe_anchor5/ckpts \
--eval_dir outputs/sd35_dpdmd/sd35m_t30_1024_lr1e5_4nfe_anchor5/eval_images \
--process_folder_name outputs/sd35_dpdmd/sd35m_t30_1024_lr1e5_4nfe_anchor5/process_vis \
--diversity_folder_name outputs/sd35_dpdmd/sd35m_t30_1024_lr1e5_4nfe_anchor5/div_vis \
log_path: stores training log information.ckpt_dir: stores checkpoint weights.eval_dir: stores generated images used for human preference evaluation during training (overwritten at each evaluation step).process_folder_name: stores student model output images during training (overwritten at each iteration).diversity_folder_name: stores images used for diversity evaluation during training (overwritten at each evaluation step).
After completing all the preparations, run the following command to start training.
bash scripts/run_train_sd35.sh
π οΈ Testing
We provide the testing files for diversity evaluation (test_diversity.py), human preference evaluation (test_preference.py), and visual quality evaluation (test_quality.py). Please ensure that the required environments for each evaluation are installed beforehand.
Instructions for modifying paths or loading model weights are included within each file.
- Human Preference:
accelerate launch --main_process_port 29512 test_preference.py - Visual Quality:
python test_quality.py- VisualQuality-R1 weight
- MANIQA weight
- Diversity:
CUDA_VISIBLE_DEVICES=0 accelerate launch --main_process_port 29519 --num_processes 1 test_diversity.py
πͺ Acknowledgement
I would like to sincerely thank Gongye Liu, Ke Lei (Tsinghua University), and Zhuoyan Luo for the generous support of my project and for the invaluable guidance in the field of generative modeling.
π§ Contact
If you have any question, please email tianhewu-c@my.cityu.edu.hk.
π BibTeX
@article{wu2026diversity,
title={Diversity-Preserved Distribution Matching Distillation for Fast Visual Synthesis},
author={Wu, Tianhe and Li, Ruibin and Zhang, Lei and Ma, Kede},
journal={arXiv preprint arXiv:2602.03139},
year={2026}
}
Model tree for TianheWu/dpdmd
Base model
stabilityai/stable-diffusion-3.5-medium