SRUM

SRUM Paper on Hugging Face GitHub Repository SRUM Website SRUM Model SRUM Data SRUM Demo

SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models

Weiyang Jin*, Yuwei Niu*, Jiaqi Liao, Chengqi Duan, Aoxue Li, Shenghua Gao, Xihui Liu :email:

contact: xihuiliu@hku.hk

Abstract: Recently, remarkable progress has been made in Unified Multimodal Models (UMMs), which integrate vision-language generation and understanding capabilities within a single framework. However, a significant gap exists where a model's strong visual understanding often fails to transfer to its visual generation. A model might correctly understand an image based on user instructions, yet be unable to generate a faithful image from text prompts. This phenomenon directly raises a compelling question: Can a model achieve self-improvement by using its understanding module to reward its generation module? To bridge this gap and achieve self-improvement, we introduce SRUM, a self-rewarding post-training framework that can be directly applied to existing UMMs of various designs. SRUM creates a feedback loop where the model's own understanding module acts as an internal evaluator, providing corrective signals to improve its generation module, without requiring additional human-labeled data. To ensure this feedback is comprehensive, we designed a global-local dual reward system. To tackle the inherent structural complexity of images, this system offers multi-scale guidance: a global reward ensures the correctness of the overall visual semantics and layout, while a local reward refines fine-grained, object-level fidelity. SRUM leads to powerful capabilities and shows strong generalization, boosting performance on T2I-CompBench from 82.18 to 88.37 and on T2I-ReasonBench from 43.82 to 46.75. Overall, our work establishes a powerful new paradigm for enabling a UMMs' understanding module to guide and enhance its own generation via self-rewarding.

We present SRUM, a post-training reward fine-tuning method based on Unified Multimodal Models (UMMs) leverages UMMs' inherent understanding capabilities to boost their generative abilities, bridging the gaps in performance caused by conflicts during the previous training phase. SRUM demonstrates exceptional generalization across both common positions and world knowledge. The figure below showcases SRUM's qualitative performance compared with SFT and Base Model.

๐Ÿ“ข News

We sincerely thank all contributors from the open community for their valuable support.

๐Ÿ“ฎ Notice

Follow the Bagel's original settings, you should focus:

About Inference Hyperparameters:

  • cfg_text_scale: Controls how strongly the model follows the text prompt. 1.0 disables text guidance. Typical range: 4.0โ€“8.0.
  • cfg_image_scale: Controls how much the model preserves input image details. 1.0 disables image guidance. Typical range: 1.0โ€“2.0.
  • cfg_interval: Fraction of denoising steps where CFG is applied. Later steps can skip CFG to reduce computation. Typical: [0.4, 1.0].
  • timestep_shift: Shifts the distribution of denoising steps. Higher values allocate more steps at the start (affects layout); lower values allocate more at the end (improves details).
  • num_timesteps: Total denoising steps. Typical: 50.
  • cfg_renorm_min: Minimum value for CFG-Renorm. 1.0 disables renorm. Typical: 0.
  • cfg_renorm_type: CFG-Renorm method:
    • global: Normalize over all tokens and channels (default for T2I).
    • channel: Normalize across channels for each token.
    • text_channel: Like channel, but only applies to text condition (good for editing, may cause blur).
  • If edited images appear blurry, try global CFG-Renorm, decrease cfg_renorm_min or decrease cfg_scale.

๐Ÿ”ฅ Quick Start

1๏ธโƒฃ Set up environment

git clone https://github.com/WayneJin0918/SRUM
cd SRUM
conda env create -f environment.yaml
conda activate SRUM
pip install -r requirements.txt

if flash attention is hard to pip, please follow:

wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.0.post2/flash_attn-2.7.0.post2+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.7.0.post2+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

Or you can follow the settings of Bagel

2๏ธโƒฃ Download Bagel pretrained or our SRUM checkpoint

#bagel
from huggingface_hub import snapshot_download

save_dir = "models/BAGEL-7B-MoT"
repo_id = "ByteDance-Seed/BAGEL-7B-MoT"
cache_dir = save_dir + "/cache"

snapshot_download(cache_dir=cache_dir,
  local_dir=save_dir,
  repo_id=repo_id,
  local_dir_use_symlinks=False,
  resume_download=True,
  allow_patterns=["*.json", "*.safetensors", "*.bin", "*.py", "*.md", "*.txt"],
)
#SRUM
from huggingface_hub import snapshot_download

save_dir = "models/SRUM_BAGEL_7B_MoT"
repo_id = "Wayne-King/SRUM_BAGEL_7B_MoT"
cache_dir = save_dir + "/cache"

snapshot_download(cache_dir=cache_dir,
  local_dir=save_dir,
  repo_id=repo_id,
  local_dir_use_symlinks=False,
  resume_download=True,
  allow_patterns=["*.json", "*.safetensors", "*.bin", "*.py", "*.md", "*.txt"],
)

๐Ÿ“Š Benchmarks

1. Composition

T2I Model 3d spatial Color Complex Nonspatial Numeracy Shape Spatial Texture Overall
FLUX.1-dev 76.39 90.63 83.51 87.47 75.30 80.20 84.23 87.07 83.10
FLUX.1-schnell 79.38 84.53 81.96 85.55 72.82 82.20 85.49 86.38 82.29
SD-3-medium 77.83 91.63 84.73 86.12 72.80 83.72 88.20 89.03 84.26
SD-xl-base-1 72.25 77.75 75.00 85.28 57.14 72.18 77.08 78.38 74.38
Unified Model 3d spatial Color Complex Nonspatial Numeracy Shape Spatial Texture Overall
Janus-Pro 76.17 84.25 80.28 80.47 56.43 65.14 79.67 69.67 74.01
Show-o2 88.61 87.73 87.88 85.91 69.74 73.99 86.60 82.17 82.83
BLIP3o 81.73 89.92 85.55 84.78 71.67 83.75 92.47 87.45 84.66
OmniGen2 82.21 92.22 86.87 88.51 72.00 83.95 90.07 90.88 85.84
Bagel 77.98 89.30 83.32 85.03 70.40 81.94 81.52 87.93 82.18
Bagel (CoT) 84.66 88.85 86.10 85.64 75.36 84.33 82.71 88.07 84.46
BLIP3o+SRUM 83.78โ†‘ 90.22โ†‘ 86.57โ†‘ 85.10โ†‘ 74.52โ†‘ 85.44โ†‘ 93.88โ†‘ 86.52โ†“ 85.75โ†‘
Bagel+SRUM 83.10โ†‘ 92.90โ†‘ 88.69โ†‘ 88.47โ†‘ 78.52โ†‘ 84.23โ†‘ 86.92โ†‘ 89.57โ†‘ 86.55โ†‘
Bagel+SRUM (CoT) ๐Ÿ† 88.60โ†‘ 92.90โ†‘ 91.31โ†‘ 90.48โ†‘ 80.12โ†‘ 84.47โ†‘ 89.93โ†‘ 89.15โ†‘ 88.37โ†‘

2. Reasoning-informed

Model Entity Idiom Scientific Textual Image Average
Bagel 49.70 34.46 47.52 43.59 43.82
Bagel+SFT 50.53 39.43 47.45 44.08 45.37
Bagel+SRUM 52.85 40.51 47.83 45.83 46.75

Performance comparison of Bagel models across four categories and their average scores. Bold values indicate the best performance in each column.

โœ๏ธ Citation

@article{deng2025bagel,
  title   = {SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models},
  author  = {Jin, Weiyang and Niu, Yuwei and Liao, Jiaqi and Duan, Chengqi and Li, Aoxue and Gao, Shenghua and Liu, Xihui},
  journal = {arXiv preprint arXiv:2510.12784},
  year    = {2025}
}

๐Ÿ“œ License

SRUM is licensed under the Apache 2.0.

Downloads last month
55
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Wayne-King/SRUM_BAGEL_7B_MoT

Base model

Qwen/Qwen2.5-7B
Finetuned
(11)
this model

Space using Wayne-King/SRUM_BAGEL_7B_MoT 1

Collection including Wayne-King/SRUM_BAGEL_7B_MoT