--- library_name: transformers tags: - multimodal - reasoning - sft - rl datasets: - multimodal-reasoning-lab/Zebra-CoT - ModalityDance/Omni-Bench base_model: - GAIR/Anole-7b-v0.1 pipeline_tag: any-to-any --- # Omni-R1 [![Paper](https://img.shields.io/badge/Paper-arXiv-b31b1b?style=for-the-badge&logo=arxiv)](https://arxiv.org/abs/2601.09536) [![Code](https://img.shields.io/badge/GitHub-Code-blue?style=for-the-badge&logo=github)](https://github.com/ModalityDance/Omni-R1) [![Omni-Bench](https://img.shields.io/badge/Dataset-Omni--Bench-fcc21b?style=for-the-badge&logo=huggingface&logoColor=white)](https://huggingface.co/datasets/ModalityDance/Omni-Bench) ## Overview **Omni-R1** is trained with multimodal interleaved supervision. It uses **PeSFT** for stable functional image generation, then **PeRPO** for RL refinement on unified tasks—enabling interleaved multimodal reasoning trajectories. ## Usage ```python import torch from PIL import Image from transformers import ChameleonProcessor, ChameleonForConditionalGeneration # 1) Import & load model_id = "ModalityDance/Omni-R1" # or "ModalityDance/Omni-R1-Zero" processor = ChameleonProcessor.from_pretrained(model_id) model = ChameleonForConditionalGeneration.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto", ) model.eval() # 2) Prepare a single input (prompt contains ) prompt = "What is the smiling man in the image wearing? " image = Image.open("image.png").convert("RGB") inputs = processor( prompt, images=[image], padding=False, return_for_text_completion=True, return_tensors="pt", ).to(model.device) # --- minimal image token preprocessing: replace placeholder with image tokens --- input_ids = inputs["input_ids"].long() pixel_values = inputs["pixel_values"] placeholder_id = processor.tokenizer.encode("", add_special_tokens=False)[0] image_tokens = model.get_image_tokens(pixel_values) # shape: [1, N] (or compatible) mask = (input_ids == placeholder_id) input_ids = input_ids.clone() input_ids[mask] = image_tokens.reshape(-1).to(dtype=torch.long, device=input_ids.device) # 3) Call the model outputs = model.generate( input_ids=input_ids, max_length=4096, do_sample=True, temperature=0.5, top_p=0.9, pad_token_id=1, multimodal_generation_mode="unrestricted", ) # 4) Get results text = processor.batch_decode(outputs, skip_special_tokens=False)[0] print(text) ``` For full scripts (batch JSONL inference, interleaved decoding, and vLLM-based evaluation), please refer to the official GitHub repository: https://github.com/ModalityDance/Omni-R1 ## License This project is licensed under the **MIT License**. It also complies with the licenses of referenced third-party projects and dependencies, including the **Chameleon Research License**. ## Citation ```bibtex @misc{cheng2026omnir1unifiedgenerativeparadigm, title={Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning}, author={Dongjie Cheng and Yongqi Li and Zhixin Ma and Hongru Cai and Yupeng Hu and Wenjie Wang and Liqiang Nie and Wenjie Li}, year={2026}, eprint={2601.09536}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2601.09536}, } ```