File size: 3,302 Bytes
d36b1f3 ef1738f d36b1f3 13e998e d36b1f3 13e998e ef1738f 13e998e ef1738f 13e998e ef1738f 13e998e ef1738f 13e998e d36b1f3 13e998e d36b1f3 13e998e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 |
---
library_name: transformers
tags:
- multimodal
- reasoning
- sft
- rl
datasets:
- multimodal-reasoning-lab/Zebra-CoT
- ModalityDance/Omni-Bench
base_model:
- GAIR/Anole-7b-v0.1
pipeline_tag: any-to-any
---
# Omni-R1
[](https://arxiv.org/abs/2601.09536)
[](https://github.com/ModalityDance/Omni-R1)
[](https://huggingface.co/datasets/ModalityDance/Omni-Bench)
## Overview
**Omni-R1** is trained with multimodal interleaved supervision. It uses **PeSFT** for stable functional image generation, then **PeRPO** for RL refinement on unified tasks—enabling interleaved multimodal reasoning trajectories.
## Usage
```python
import torch
from PIL import Image
from transformers import ChameleonProcessor, ChameleonForConditionalGeneration
# 1) Import & load
model_id = "ModalityDance/Omni-R1" # or "ModalityDance/Omni-R1-Zero"
processor = ChameleonProcessor.from_pretrained(model_id)
model = ChameleonForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
model.eval()
# 2) Prepare a single input (prompt contains <image>)
prompt = "What is the smiling man in the image wearing? <image>"
image = Image.open("image.png").convert("RGB")
inputs = processor(
prompt,
images=[image],
padding=False,
return_for_text_completion=True,
return_tensors="pt",
).to(model.device)
# --- minimal image token preprocessing: replace <image> placeholder with image tokens ---
input_ids = inputs["input_ids"].long()
pixel_values = inputs["pixel_values"]
placeholder_id = processor.tokenizer.encode("<image>", add_special_tokens=False)[0]
image_tokens = model.get_image_tokens(pixel_values) # shape: [1, N] (or compatible)
mask = (input_ids == placeholder_id)
input_ids = input_ids.clone()
input_ids[mask] = image_tokens.reshape(-1).to(dtype=torch.long, device=input_ids.device)
# 3) Call the model
outputs = model.generate(
input_ids=input_ids,
max_length=4096,
do_sample=True,
temperature=0.5,
top_p=0.9,
pad_token_id=1,
multimodal_generation_mode="unrestricted",
)
# 4) Get results
text = processor.batch_decode(outputs, skip_special_tokens=False)[0]
print(text)
```
For full scripts (batch JSONL inference, interleaved decoding, and vLLM-based evaluation), please refer to the official GitHub repository:
https://github.com/ModalityDance/Omni-R1
## License
This project is licensed under the **MIT License**.
It also complies with the licenses of referenced third-party projects and dependencies, including the **Chameleon Research License**.
## Citation
```bibtex
@misc{cheng2026omnir1unifiedgenerativeparadigm,
title={Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning},
author={Dongjie Cheng and Yongqi Li and Zhixin Ma and Hongru Cai and Yupeng Hu and Wenjie Wang and Liqiang Nie and Wenjie Li},
year={2026},
eprint={2601.09536},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2601.09536},
}
```
|