SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models
Weiyang Jin*, Yuwei Niu*, Jiaqi Liao, Chengqi Duan, Aoxue Li, Shenghua Gao, Xihui Liu :email:
contact: xihuiliu@hku.hk
Abstract: Recently, remarkable progress has been made in Unified Multimodal Models (UMMs), which integrate vision-language generation and understanding capabilities within a single framework. However, a significant gap exists where a model's strong visual understanding often fails to transfer to its visual generation. A model might correctly understand an image based on user instructions, yet be unable to generate a faithful image from text prompts. This phenomenon directly raises a compelling question: Can a model achieve self-improvement by using its understanding module to reward its generation module? To bridge this gap and achieve self-improvement, we introduce SRUM, a self-rewarding post-training framework that can be directly applied to existing UMMs of various designs. SRUM creates a feedback loop where the model's own understanding module acts as an internal
evaluator, providing corrective signals to improve its generation module, without requiring additional human-labeled data. To ensure this feedback is comprehensive, we designed a global-local dual reward system. To tackle the inherent structural complexity of images, this system offers multi-scale guidance: a global reward ensures the correctness of the overall visual semantics and layout, while a local reward refines fine-grained, object-level fidelity. SRUM leads to powerful capabilities and shows strong generalization, boosting performance on T2I-CompBench from 82.18 to 88.37 and on T2I-ReasonBench from 43.82 to 46.75. Overall, our work establishes a powerful new paradigm for enabling a UMMs' understanding module to guide and enhance its own generation via self-rewarding.
We present SRUM, a post-training reward fine-tuning method based on Unified Multimodal Models (UMMs) leverages UMMs' inherent understanding capabilities to boost their generative abilities, bridging the gaps in performance caused by conflicts during the previous training phase. SRUM demonstrates exceptional generalization across both common positions and world knowledge. The figure below showcases SRUM's qualitative performance compared with SFT and Base Model.
๐ข News
We sincerely thank all contributors from the open community for their valuable support.
- Nov. 15, 2025: We released the official website, model, and report for SRUM. And please upvote for our huggingface daily paper as well as try the demo
๐ฎ Notice
Follow the Bagel's original settings, you should focus:
About Inference Hyperparameters:
cfg_text_scale: Controls how strongly the model follows the text prompt.1.0disables text guidance. Typical range:4.0โ8.0.cfg_image_scale: Controls how much the model preserves input image details.1.0disables image guidance. Typical range:1.0โ2.0.cfg_interval: Fraction of denoising steps where CFG is applied. Later steps can skip CFG to reduce computation. Typical:[0.4, 1.0].timestep_shift: Shifts the distribution of denoising steps. Higher values allocate more steps at the start (affects layout); lower values allocate more at the end (improves details).num_timesteps: Total denoising steps. Typical:50.cfg_renorm_min: Minimum value for CFG-Renorm.1.0disables renorm. Typical:0.cfg_renorm_type: CFG-Renorm method:global: Normalize over all tokens and channels (default for T2I).channel: Normalize across channels for each token.text_channel: Likechannel, but only applies to text condition (good for editing, may cause blur).
- If edited images appear blurry, try
globalCFG-Renorm, decreasecfg_renorm_minor decreasecfg_scale.
๐ฅ Quick Start
1๏ธโฃ Set up environment
git clone https://github.com/WayneJin0918/SRUM
cd SRUM
conda env create -f environment.yaml
conda activate SRUM
pip install -r requirements.txt
if flash attention is hard to pip, please follow:
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.0.post2/flash_attn-2.7.0.post2+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.7.0.post2+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
Or you can follow the settings of Bagel
2๏ธโฃ Download Bagel pretrained or our SRUM checkpoint
#bagel
from huggingface_hub import snapshot_download
save_dir = "models/BAGEL-7B-MoT"
repo_id = "ByteDance-Seed/BAGEL-7B-MoT"
cache_dir = save_dir + "/cache"
snapshot_download(cache_dir=cache_dir,
local_dir=save_dir,
repo_id=repo_id,
local_dir_use_symlinks=False,
resume_download=True,
allow_patterns=["*.json", "*.safetensors", "*.bin", "*.py", "*.md", "*.txt"],
)
#SRUM
from huggingface_hub import snapshot_download
save_dir = "models/SRUM_BAGEL_7B_MoT"
repo_id = "Wayne-King/SRUM_BAGEL_7B_MoT"
cache_dir = save_dir + "/cache"
snapshot_download(cache_dir=cache_dir,
local_dir=save_dir,
repo_id=repo_id,
local_dir_use_symlinks=False,
resume_download=True,
allow_patterns=["*.json", "*.safetensors", "*.bin", "*.py", "*.md", "*.txt"],
)
๐ Benchmarks
1. Composition
| T2I Model | 3d spatial | Color | Complex | Nonspatial | Numeracy | Shape | Spatial | Texture | Overall |
|---|---|---|---|---|---|---|---|---|---|
| FLUX.1-dev | 76.39 | 90.63 | 83.51 | 87.47 | 75.30 | 80.20 | 84.23 | 87.07 | 83.10 |
| FLUX.1-schnell | 79.38 | 84.53 | 81.96 | 85.55 | 72.82 | 82.20 | 85.49 | 86.38 | 82.29 |
| SD-3-medium | 77.83 | 91.63 | 84.73 | 86.12 | 72.80 | 83.72 | 88.20 | 89.03 | 84.26 |
| SD-xl-base-1 | 72.25 | 77.75 | 75.00 | 85.28 | 57.14 | 72.18 | 77.08 | 78.38 | 74.38 |
| Unified Model | 3d spatial | Color | Complex | Nonspatial | Numeracy | Shape | Spatial | Texture | Overall |
|---|---|---|---|---|---|---|---|---|---|
| Janus-Pro | 76.17 | 84.25 | 80.28 | 80.47 | 56.43 | 65.14 | 79.67 | 69.67 | 74.01 |
| Show-o2 | 88.61 | 87.73 | 87.88 | 85.91 | 69.74 | 73.99 | 86.60 | 82.17 | 82.83 |
| BLIP3o | 81.73 | 89.92 | 85.55 | 84.78 | 71.67 | 83.75 | 92.47 | 87.45 | 84.66 |
| OmniGen2 | 82.21 | 92.22 | 86.87 | 88.51 | 72.00 | 83.95 | 90.07 | 90.88 | 85.84 |
| Bagel | 77.98 | 89.30 | 83.32 | 85.03 | 70.40 | 81.94 | 81.52 | 87.93 | 82.18 |
| Bagel (CoT) | 84.66 | 88.85 | 86.10 | 85.64 | 75.36 | 84.33 | 82.71 | 88.07 | 84.46 |
| BLIP3o+SRUM | 83.78โ | 90.22โ | 86.57โ | 85.10โ | 74.52โ | 85.44โ | 93.88โ | 86.52โ | 85.75โ |
| Bagel+SRUM | 83.10โ | 92.90โ | 88.69โ | 88.47โ | 78.52โ | 84.23โ | 86.92โ | 89.57โ | 86.55โ |
| Bagel+SRUM (CoT) ๐ | 88.60โ | 92.90โ | 91.31โ | 90.48โ | 80.12โ | 84.47โ | 89.93โ | 89.15โ | 88.37โ |
2. Reasoning-informed
| Model | Entity | Idiom | Scientific | Textual Image | Average |
|---|---|---|---|---|---|
| Bagel | 49.70 | 34.46 | 47.52 | 43.59 | 43.82 |
| Bagel+SFT | 50.53 | 39.43 | 47.45 | 44.08 | 45.37 |
| Bagel+SRUM | 52.85 | 40.51 | 47.83 | 45.83 | 46.75 |
Performance comparison of Bagel models across four categories and their average scores. Bold values indicate the best performance in each column.
โ๏ธ Citation
@article{deng2025bagel,
title = {SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models},
author = {Jin, Weiyang and Niu, Yuwei and Liao, Jiaqi and Duan, Chengqi and Li, Aoxue and Gao, Shenghua and Liu, Xihui},
journal = {arXiv preprint arXiv:2510.12784},
year = {2025}
}
๐ License
SRUM is licensed under the Apache 2.0.
- Downloads last month
- 55