Omni-R1 / README.md

charlesdj

Update README.md

ef1738f verified 1 day ago

preview code

raw

history blame contribute delete

3.3 kB

metadata

library_name: transformers
tags:
  - multimodal
  - reasoning
  - sft
  - rl
datasets:
  - multimodal-reasoning-lab/Zebra-CoT
  - ModalityDance/Omni-Bench
base_model:
  - GAIR/Anole-7b-v0.1
pipeline_tag: any-to-any

Omni-R1

Overview

Omni-R1 is trained with multimodal interleaved supervision. It uses PeSFT for stable functional image generation, then PeRPO for RL refinement on unified tasks—enabling interleaved multimodal reasoning trajectories.

Usage

import torch
from PIL import Image
from transformers import ChameleonProcessor, ChameleonForConditionalGeneration

# 1) Import & load
model_id = "ModalityDance/Omni-R1"  # or "ModalityDance/Omni-R1-Zero"
processor = ChameleonProcessor.from_pretrained(model_id)
model = ChameleonForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model.eval()

# 2) Prepare a single input (prompt contains <image>)
prompt = "What is the smiling man in the image wearing? <image>"
image = Image.open("image.png").convert("RGB")

inputs = processor(
    prompt,
    images=[image],
    padding=False,
    return_for_text_completion=True,
    return_tensors="pt",
).to(model.device)

# --- minimal image token preprocessing: replace <image> placeholder with image tokens ---
input_ids = inputs["input_ids"].long()
pixel_values = inputs["pixel_values"]

placeholder_id = processor.tokenizer.encode("<image>", add_special_tokens=False)[0]
image_tokens = model.get_image_tokens(pixel_values)  # shape: [1, N] (or compatible)

mask = (input_ids == placeholder_id)
input_ids = input_ids.clone()
input_ids[mask] = image_tokens.reshape(-1).to(dtype=torch.long, device=input_ids.device)

# 3) Call the model
outputs = model.generate(
    input_ids=input_ids,
    max_length=4096,
    do_sample=True,
    temperature=0.5,
    top_p=0.9,
    pad_token_id=1,
    multimodal_generation_mode="unrestricted",
)

# 4) Get results
text = processor.batch_decode(outputs, skip_special_tokens=False)[0]
print(text)

For full scripts (batch JSONL inference, interleaved decoding, and vLLM-based evaluation), please refer to the official GitHub repository:
https://github.com/ModalityDance/Omni-R1

License

This project is licensed under the MIT License.
It also complies with the licenses of referenced third-party projects and dependencies, including the Chameleon Research License.

Citation

@misc{cheng2026omnir1unifiedgenerativeparadigm,
      title={Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning}, 
      author={Dongjie Cheng and Yongqi Li and Zhixin Ma and Hongru Cai and Yupeng Hu and Wenjie Wang and Liqiang Nie and Wenjie Li},
      year={2026},
      eprint={2601.09536},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2601.09536}, 
}