Any-to-Any
Transformers
Safetensors
chameleon
image-to-text
multimodal
reasoning
sft
rl
File size: 3,302 Bytes
d36b1f3
 
 
 
 
 
 
 
 
 
 
 
ef1738f
d36b1f3
 
 
 
13e998e
 
 
d36b1f3
13e998e
 
 
 
 
 
 
 
 
 
 
 
ef1738f
13e998e
 
 
 
 
 
 
 
ef1738f
13e998e
 
 
 
 
 
 
 
 
 
 
ef1738f
 
 
 
 
 
 
 
 
 
 
13e998e
 
ef1738f
13e998e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d36b1f3
 
13e998e
d36b1f3
 
 
 
 
 
 
 
 
 
13e998e
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
---
library_name: transformers
tags:
- multimodal
- reasoning
- sft
- rl
datasets:
- multimodal-reasoning-lab/Zebra-CoT
- ModalityDance/Omni-Bench
base_model:
- GAIR/Anole-7b-v0.1
pipeline_tag: any-to-any
---

# Omni-R1

[![Paper](https://img.shields.io/badge/Paper-arXiv-b31b1b?style=for-the-badge&logo=arxiv)](https://arxiv.org/abs/2601.09536)
[![Code](https://img.shields.io/badge/GitHub-Code-blue?style=for-the-badge&logo=github)](https://github.com/ModalityDance/Omni-R1)
[![Omni-Bench](https://img.shields.io/badge/Dataset-Omni--Bench-fcc21b?style=for-the-badge&logo=huggingface&logoColor=white)](https://huggingface.co/datasets/ModalityDance/Omni-Bench)

## Overview

**Omni-R1** is trained with multimodal interleaved supervision. It uses **PeSFT** for stable functional image generation, then **PeRPO** for RL refinement on unified tasks—enabling interleaved multimodal reasoning trajectories.

## Usage

```python
import torch
from PIL import Image
from transformers import ChameleonProcessor, ChameleonForConditionalGeneration

# 1) Import & load
model_id = "ModalityDance/Omni-R1"  # or "ModalityDance/Omni-R1-Zero"
processor = ChameleonProcessor.from_pretrained(model_id)
model = ChameleonForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model.eval()

# 2) Prepare a single input (prompt contains <image>)
prompt = "What is the smiling man in the image wearing? <image>"
image = Image.open("image.png").convert("RGB")

inputs = processor(
    prompt,
    images=[image],
    padding=False,
    return_for_text_completion=True,
    return_tensors="pt",
).to(model.device)

# --- minimal image token preprocessing: replace <image> placeholder with image tokens ---
input_ids = inputs["input_ids"].long()
pixel_values = inputs["pixel_values"]

placeholder_id = processor.tokenizer.encode("<image>", add_special_tokens=False)[0]
image_tokens = model.get_image_tokens(pixel_values)  # shape: [1, N] (or compatible)

mask = (input_ids == placeholder_id)
input_ids = input_ids.clone()
input_ids[mask] = image_tokens.reshape(-1).to(dtype=torch.long, device=input_ids.device)

# 3) Call the model
outputs = model.generate(
    input_ids=input_ids,
    max_length=4096,
    do_sample=True,
    temperature=0.5,
    top_p=0.9,
    pad_token_id=1,
    multimodal_generation_mode="unrestricted",
)

# 4) Get results
text = processor.batch_decode(outputs, skip_special_tokens=False)[0]
print(text)
```

For full scripts (batch JSONL inference, interleaved decoding, and vLLM-based evaluation), please refer to the official GitHub repository:  
https://github.com/ModalityDance/Omni-R1

## License

This project is licensed under the **MIT License**.  
It also complies with the licenses of referenced third-party projects and dependencies, including the **Chameleon Research License**.

## Citation

```bibtex
@misc{cheng2026omnir1unifiedgenerativeparadigm,
      title={Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning}, 
      author={Dongjie Cheng and Yongqi Li and Zhixin Ma and Hongru Cai and Yupeng Hu and Wenjie Wang and Liqiang Nie and Wenjie Li},
      year={2026},
      eprint={2601.09536},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2601.09536}, 
}
```