File size: 7,512 Bytes
e4c3491 69e1532 e4c3491 69e1532 e4c3491 aa8707e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 |
---
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
language:
- en
library_name: transformers
license: apache-2.0
pipeline_tag: image-text-to-text
tags:
- multi-modal
- large-language-model
---
<p align="center">
<img src="https://github.com/LengSicong/MMR1/blob/main/assets/logo.png?raw=true" width="150" style="margin-bottom: 0.2;"/>
<p>
<h3 align="center">
MMR1: Advancing the Frontiers of Multimodal Reasoning</a></h3>
<h5 align="center"> If you like our project, please give us a star ⭐ on <a href="https://github.com/LengSicong/MMR1">Github</a> to support us. 🙏🙏 </h2>
## 📰 News
* **[2025.03.11]** 🔥🔥 Release MMR1-Math-v0, achieving SOTA with only 6k data!
## Links
Code: https://github.com/LengSicong/MMR1
This model was presented in the paper [LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL](https://arxiv.org/abs/2503.07536). Code can be found at https://github.com/LengSicong/MMR1
## Model Description
MMR1-Math-v0-7B is a Large Multimodal Model specialized in mathematical tasks. Remarkably, MMR1-Math-v0-7B achieves state-of-the-art performance among open-source 7B multimodal models, competing effectively even against proprietary models with significantly larger parameter sizes—all trained using only 6k carefully curated data instances.
### Key Highlights:
- **SOTA Performance**: Sets a new **state-of-the-art** benchmark on math-related multimodal tasks among open-source 7B models.
- **Minimal Training Data**: Remarkably achieves top-tier performance with just **6k** high-quality samples from **public training datasets**.
- **Efficient Training with GRPO**: 6 hours of RL training with 64 H100s for 15 epochs.
- **Public and High-Quality Data**: Publicly sourced datasets, rigorously filtered and balanced across both difficulty and mathematical problem types.
- **Balanced Data Strategy**: Uniform sampling of data based on both task difficulty (filtering out overly simple problems) and mathematical reasoning diversity.
## Evaluation Results
We evaluated our model using [VLMEvalKit](https://github.com/open-compass/VLMEvalKit/tree/main) on four mathematical reasoning benchmarks: MathVista_MINI, MathVision, LogicVista, and MathVerse_MINI.
We also include results on the MathVerse_MINI_Vision_Only_cot (MathVerse_V) subset to maintain consistency with the VLMEvalKit leaderboard. The table below compares our model's performance against various open-source and proprietary models.
| Model | size | MathVista | MathVision | LogicVista | MathVerse | MathVerse_V |
|-------|:----:|:--------------:|:----------:|:----------:|:--------------:|:-------------------:|
| **Close-sourced** | | | | | | |
| [GPT-4o 1120](https://openai.com/index/gpt-4o-system-card/) | - | 60.0 | 31.2 | 52.8 | 40.6 | - |
| [Gemini-2.0-flash](https://deepmind.google/technologies/gemini/flash/) | - | 70.4 | 43.6 | 52.3 | 47.8 | - |
| [Claude3.7-Sonnet](https://www.anthropic.com/news/claude-3-7-sonnet) | - | 66.8 | 41.9 | 58.2 | 46.7 | - |
| **R1-related** | | | | | | |
| [LLaVA-CoT](https://github.com/PKU-YuanGroup/LLaVA-CoT) | 11B | 52.5 | 19.9 | 39.6 | 22.6 | - |
| [Open-R1-Multimodal](https://github.com/EvolvingLMMs-Lab/open-r1-multimodal) | 7B | 60.6 | - | - | - | - |
| [Mulberry](https://github.com/HJYao00/Mulberry) | 7B | 63.1 | - | - | - | - |
| [LMM-R1](https://arxiv.org/abs/2503.07536) | 3B | 63.2 | 26.4 | - | - | 41.6 |
| [R1-Onevision](https://github.com/Fancy-MLLM/R1-Onevision?tab=readme-ov-file) | 7B | - | 26.2 | - | - | 44.1 |
| [MM-Eureka](https://github.com/ModalMinds/MM-EUREKA) | 8B | 67.1 | 22.2 | - | - | 40.4 |
| [MM-Eureka](https://github.com/ModalMinds/MM-EUREKA) | 38B | 64.2 | 26.6 | - | - | 48.9 |
| **Open-sourced** | | | | | | |
| [Ovis2-8b](https://github.com/AIDC-AI/Ovis) | 8B | 71.8 | 25.9 | 39.4 | 42.3 | - |
| [MiniCPM-o-2.6](https://github.com/OpenBMB/MiniCPM-o) | 8B | **71.9** | 21.7 | 36.0 | 35.0 | - |
| [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL) (official) | 7B | 68.2 | 25.4 | 47.9 | 41.1 | - |
| [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL) (reproduced) | 7B | 67.5 | 25.6 | 46.8 | 42.5 | 46.9 |
| **Ours** | | | | | | |
| **MMR1-math-v0** | 7B | 71.0 | **30.2** | **50.8** | **45.1** | **49.8** |
### Quick Start
```python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"MMR1/MMR1-Math-v0-7B",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="auto",
)
# default processer
processor = AutoProcessor.from_pretrained("MMR1/MMR1-Math-v0-7B")
# Example input
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "path/to/image.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```
<details>
<summary>Batch inference</summary>
```python
# Sample messages for batch inference
messages1 = [
{
"role": "user",
"content": [
{"type": "image", "image": "file:///path/to/image1.jpg"},
{"type": "image", "image": "file:///path/to/image2.jpg"},
{"type": "text", "text": "What are the common elements in these pictures?"},
],
}
]
messages2 = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who are you?"},
]
# Combine messages for batch processing
messages = [messages1, messages2]
# Preparation for batch inference
texts = [
processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=texts,
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Batch Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_texts)
```
</details>
## Citation
If you find MMR1 useful for your research and applications, please cite using this BibTeX:
```bibtex
@misc{MMR1-Math2025,
title={MMR1: Advancing the Frontiers of Multimodal Reasoning},
author={Sicong Leng*, Jing Wang*, Jiaxi Li*, Hao Zhang*, Zhiqiang Hu, Boqiang Zhang, Hang Zhang, Yuming Jiang, Xin Li, Fan Wang, Yu Rong, Aixin Sun†, Shijian Lu†},
year={2025},
howpublished={\url{https://github.com/LengSicong/MMR1}},
}
``` |