SRUM

SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models

Weiyang Jin*, Yuwei Niu*, Jiaqi Liao, Chengqi Duan, Aoxue Li, Shenghua Gao, Xihui Liu :email:

contact: xihuiliu@hku.hk

Abstract: Recently, remarkable progress has been made in Unified Multimodal Models (UMMs), which integrate vision-language generation and understanding capabilities within a single framework. However, a significant gap exists where a model's strong visual understanding often fails to transfer to its visual generation. A model might correctly understand an image based on user instructions, yet be unable to generate a faithful image from text prompts. This phenomenon directly raises a compelling question: Can a model achieve self-improvement by using its understanding module to reward its generation module? To bridge this gap and achieve self-improvement, we introduce SRUM, a self-rewarding post-training framework that can be directly applied to existing UMMs of various designs. SRUM creates a feedback loop where the model's own understanding module acts as an internal evaluator, providing corrective signals to improve its generation module, without requiring additional human-labeled data. To ensure this feedback is comprehensive, we designed a global-local dual reward system. To tackle the inherent structural complexity of images, this system offers multi-scale guidance: a global reward ensures the correctness of the overall visual semantics and layout, while a local reward refines fine-grained, object-level fidelity. SRUM leads to powerful capabilities and shows strong generalization, boosting performance on T2I-CompBench from 82.18 to 88.37 and on T2I-ReasonBench from 43.82 to 46.75. Overall, our work establishes a powerful new paradigm for enabling a UMMs' understanding module to guide and enhance its own generation via self-rewarding.

We present SRUM, a post-training reward fine-tuning method based on Unified Multimodal Models (UMMs) leverages UMMs' inherent understanding capabilities to boost their generative abilities, bridging the gaps in performance caused by conflicts during the previous training phase. SRUM demonstrates exceptional generalization across both common positions and world knowledge. The figure below showcases SRUM's qualitative performance compared with SFT and Base Model.

📢 News

We sincerely thank all contributors from the open community for their valuable support.

Nov. 15, 2025: We released the official website, model, and report for SRUM. And please upvote for our huggingface daily paper as well as try the demo

📮 Notice

Follow the Bagel's original settings, you should focus:

About Inference Hyperparameters:

cfg_text_scale: Controls how strongly the model follows the text prompt. 1.0 disables text guidance. Typical range: 4.0–8.0.
cfg_image_scale: Controls how much the model preserves input image details. 1.0 disables image guidance. Typical range: 1.0–2.0.
cfg_interval: Fraction of denoising steps where CFG is applied. Later steps can skip CFG to reduce computation. Typical: [0.4, 1.0].
timestep_shift: Shifts the distribution of denoising steps. Higher values allocate more steps at the start (affects layout); lower values allocate more at the end (improves details).
num_timesteps: Total denoising steps. Typical: 50.
cfg_renorm_min: Minimum value for CFG-Renorm. 1.0 disables renorm. Typical: 0.
cfg_renorm_type: CFG-Renorm method:
- global: Normalize over all tokens and channels (default for T2I).
- channel: Normalize across channels for each token.
- text_channel: Like channel, but only applies to text condition (good for editing, may cause blur).
If edited images appear blurry, try global CFG-Renorm, decrease cfg_renorm_min or decrease cfg_scale.

🔥 Quick Start

1️⃣ Set up environment

git clone https://github.com/WayneJin0918/SRUM
cd SRUM
conda env create -f environment.yaml
conda activate SRUM
pip install -r requirements.txt

if flash attention is hard to pip, please follow:

wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.0.post2/flash_attn-2.7.0.post2+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.7.0.post2+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

Or you can follow the settings of Bagel

2️⃣ Download Bagel pretrained or our SRUM checkpoint

#bagel
from huggingface_hub import snapshot_download

save_dir = "models/BAGEL-7B-MoT"
repo_id = "ByteDance-Seed/BAGEL-7B-MoT"
cache_dir = save_dir + "/cache"

snapshot_download(cache_dir=cache_dir,
  local_dir=save_dir,
  repo_id=repo_id,
  local_dir_use_symlinks=False,
  resume_download=True,
  allow_patterns=["*.json", "*.safetensors", "*.bin", "*.py", "*.md", "*.txt"],
)

#SRUM
from huggingface_hub import snapshot_download

save_dir = "models/SRUM_BAGEL_7B_MoT"
repo_id = "Wayne-King/SRUM_BAGEL_7B_MoT"
cache_dir = save_dir + "/cache"

snapshot_download(cache_dir=cache_dir,
  local_dir=save_dir,
  repo_id=repo_id,
  local_dir_use_symlinks=False,
  resume_download=True,
  allow_patterns=["*.json", "*.safetensors", "*.bin", "*.py", "*.md", "*.txt"],
)

📊 Benchmarks

1. Composition

T2I Model	3d spatial	Color	Complex	Nonspatial	Numeracy	Shape	Spatial	Texture	Overall
FLUX.1-dev	76.39	90.63	83.51	87.47	75.30	80.20	84.23	87.07	83.10
FLUX.1-schnell	79.38	84.53	81.96	85.55	72.82	82.20	85.49	86.38	82.29
SD-3-medium	77.83	91.63	84.73	86.12	72.80	83.72	88.20	89.03	84.26
SD-xl-base-1	72.25	77.75	75.00	85.28	57.14	72.18	77.08	78.38	74.38

Unified Model	3d spatial	Color	Complex	Nonspatial	Numeracy	Shape	Spatial	Texture	Overall
Janus-Pro	76.17	84.25	80.28	80.47	56.43	65.14	79.67	69.67	74.01
Show-o2	88.61	87.73	87.88	85.91	69.74	73.99	86.60	82.17	82.83
BLIP3o	81.73	89.92	85.55	84.78	71.67	83.75	92.47	87.45	84.66
OmniGen2	82.21	92.22	86.87	88.51	72.00	83.95	90.07	90.88	85.84
Bagel	77.98	89.30	83.32	85.03	70.40	81.94	81.52	87.93	82.18
Bagel (CoT)	84.66	88.85	86.10	85.64	75.36	84.33	82.71	88.07	84.46
BLIP3o+SRUM	83.78↑	90.22↑	86.57↑	85.10↑	74.52↑	85.44↑	93.88↑	86.52↓	85.75↑
Bagel+SRUM	83.10↑	92.90↑	88.69↑	88.47↑	78.52↑	84.23↑	86.92↑	89.57↑	86.55↑
Bagel+SRUM (CoT) 🏆	88.60↑	92.90↑	91.31↑	90.48↑	80.12↑	84.47↑	89.93↑	89.15↑	88.37↑

2. Reasoning-informed

Model	Entity	Idiom	Scientific	Textual Image	Average
Bagel	49.70	34.46	47.52	43.59	43.82
Bagel+SFT	50.53	39.43	47.45	44.08	45.37
Bagel+SRUM	52.85	40.51	47.83	45.83	46.75

Performance comparison of Bagel models across four categories and their average scores. Bold values indicate the best performance in each column.

✍️ Citation

@article{deng2025bagel,
  title   = {SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models},
  author  = {Jin, Weiyang and Niu, Yuwei and Liao, Jiaqi and Duan, Chengqi and Li, Aoxue and Gao, Shenghua and Liu, Xihui},
  journal = {arXiv preprint arXiv:2510.12784},
  year    = {2025}
}