metadata
library_name: transformers
tags:
- multimodal
- reasoning
- sft
- rl
datasets:
- multimodal-reasoning-lab/Zebra-CoT
- ModalityDance/Omni-Bench
base_model:
- GAIR/Anole-7b-v0.1
pipeline_tag: any-to-any
Omni-R1
Overview
Omni-R1 is trained with multimodal interleaved supervision. It uses PeSFT for stable functional image generation, then PeRPO for RL refinement on unified tasks—enabling interleaved multimodal reasoning trajectories.
Usage
import torch
from PIL import Image
from transformers import ChameleonProcessor, ChameleonForConditionalGeneration
# 1) Import & load
model_id = "ModalityDance/Omni-R1" # or "ModalityDance/Omni-R1-Zero"
processor = ChameleonProcessor.from_pretrained(model_id)
model = ChameleonForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
model.eval()
# 2) Prepare a single input (prompt contains <image>)
prompt = "What is the smiling man in the image wearing? <image>"
image = Image.open("image.png").convert("RGB")
inputs = processor(
prompt,
images=[image],
padding=False,
return_for_text_completion=True,
return_tensors="pt",
).to(model.device)
# --- minimal image token preprocessing: replace <image> placeholder with image tokens ---
input_ids = inputs["input_ids"].long()
pixel_values = inputs["pixel_values"]
placeholder_id = processor.tokenizer.encode("<image>", add_special_tokens=False)[0]
image_tokens = model.get_image_tokens(pixel_values) # shape: [1, N] (or compatible)
mask = (input_ids == placeholder_id)
input_ids = input_ids.clone()
input_ids[mask] = image_tokens.reshape(-1).to(dtype=torch.long, device=input_ids.device)
# 3) Call the model
outputs = model.generate(
input_ids=input_ids,
max_length=4096,
do_sample=True,
temperature=0.5,
top_p=0.9,
pad_token_id=1,
multimodal_generation_mode="unrestricted",
)
# 4) Get results
text = processor.batch_decode(outputs, skip_special_tokens=False)[0]
print(text)
For full scripts (batch JSONL inference, interleaved decoding, and vLLM-based evaluation), please refer to the official GitHub repository:
https://github.com/ModalityDance/Omni-R1
License
This project is licensed under the MIT License.
It also complies with the licenses of referenced third-party projects and dependencies, including the Chameleon Research License.
Citation
@misc{cheng2026omnir1unifiedgenerativeparadigm,
title={Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning},
author={Dongjie Cheng and Yongqi Li and Zhixin Ma and Hongru Cai and Yupeng Hu and Wenjie Wang and Liqiang Nie and Wenjie Li},
year={2026},
eprint={2601.09536},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2601.09536},
}