Improve model card: Add pipeline tag, library name, language, and explicit links
Browse filesThis PR enhances the model card by:
- Adding `pipeline_tag: image-text-to-text` to improve discoverability on the Hugging Face Hub.
- Specifying `library_name: transformers` to enable the automated "how to use" widget, as the model demonstrates compatibility with the 🤗 Transformers library.
- Including `language: en` as an additional tag, reflecting the primary language of the model and its documentation.
- Adding explicit "Paper" and "Code" sections with links to the arXiv paper and the GitHub repository, respectively, for clearer navigation.
The existing "QuickStart" section containing sample usage code snippets is preserved.
README.md
CHANGED
|
@@ -1,7 +1,10 @@
|
|
| 1 |
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
base_model:
|
| 4 |
- OpenGVLab/InternVL3-2B
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
---
|
| 6 |
|
| 7 |
**EN** | [中文](README_CN.md)
|
|
@@ -21,6 +24,12 @@ base_model:
|
|
| 21 |
<img alt="Leaderboard" src="https://img.shields.io/badge/%F0%9F%A4%97%20_EASI-Leaderboard-ffc107?color=ffc107&logoColor=white" height="20" />
|
| 22 |
</a>
|
| 23 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
## Overview
|
| 25 |
Despite remarkable progress, multimodal foundation models still exhibit surprising deficiencies in spatial intelligence.
|
| 26 |
In this work, we explore scaling up multimodal foundation models to cultivate spatial intelligence within the **SenseNova-SI family**,
|
|
@@ -131,4 +140,131 @@ which achieve state-of-the-art performance among open-source models of comparabl
|
|
| 131 |
<td>GPT-5-2025-08-07</td><td>55.0</td><td>41.8</td><td>56.3</td><td>45.5</td><td>61.8</td>
|
| 132 |
</tr>
|
| 133 |
</tbody>
|
| 134 |
-
</table>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
|
|
|
| 2 |
base_model:
|
| 3 |
- OpenGVLab/InternVL3-2B
|
| 4 |
+
license: apache-2.0
|
| 5 |
+
pipeline_tag: image-text-to-text
|
| 6 |
+
library_name: transformers
|
| 7 |
+
language: en
|
| 8 |
---
|
| 9 |
|
| 10 |
**EN** | [中文](README_CN.md)
|
|
|
|
| 24 |
<img alt="Leaderboard" src="https://img.shields.io/badge/%F0%9F%A4%97%20_EASI-Leaderboard-ffc107?color=ffc107&logoColor=white" height="20" />
|
| 25 |
</a>
|
| 26 |
|
| 27 |
+
## Paper
|
| 28 |
+
The model was presented in the paper [Scaling Spatial Intelligence with Multimodal Foundation Models](https://arxiv.org/abs/2511.13719).
|
| 29 |
+
|
| 30 |
+
## Code
|
| 31 |
+
The official code can be found on the [OpenSenseNova/SenseNova-SI GitHub repository](https://github.com/OpenSenseNova/SenseNova-SI).
|
| 32 |
+
|
| 33 |
## Overview
|
| 34 |
Despite remarkable progress, multimodal foundation models still exhibit surprising deficiencies in spatial intelligence.
|
| 35 |
In this work, we explore scaling up multimodal foundation models to cultivate spatial intelligence within the **SenseNova-SI family**,
|
|
|
|
| 140 |
<td>GPT-5-2025-08-07</td><td>55.0</td><td>41.8</td><td>56.3</td><td>45.5</td><td>61.8</td>
|
| 141 |
</tr>
|
| 142 |
</tbody>
|
| 143 |
+
</table>
|
| 144 |
+
|
| 145 |
+
## 🛠️ QuickStart
|
| 146 |
+
|
| 147 |
+
### Installation
|
| 148 |
+
|
| 149 |
+
We recommend using [uv](https://docs.astral.sh/uv/) to manage the environment.
|
| 150 |
+
|
| 151 |
+
> uv installation guide: <https://docs.astral.sh/uv/getting-started/installation/#installing-uv>
|
| 152 |
+
|
| 153 |
+
```bash
|
| 154 |
+
git clone git@github.com:OpenSenseNova/SenseNova-SI.git
|
| 155 |
+
cd SenseNova-SI/
|
| 156 |
+
uv sync --extra cu124 # or one of [cu118|cu121|cu124|cu126|cu128|cu129], depending on your CUDA version
|
| 157 |
+
uv sync
|
| 158 |
+
source .venv/bin/activate
|
| 159 |
+
```
|
| 160 |
+
|
| 161 |
+
#### Hello World
|
| 162 |
+
|
| 163 |
+
A simple image-free test to verify environment setup and download the model.
|
| 164 |
+
|
| 165 |
+
```bash
|
| 166 |
+
python example.py \
|
| 167 |
+
--question "Hello" \
|
| 168 |
+
--model_path sensenova/SenseNova-SI-1.1-InternVL3-8B
|
| 169 |
+
```
|
| 170 |
+
|
| 171 |
+
### Examples
|
| 172 |
+
|
| 173 |
+
#### Example 1
|
| 174 |
+
|
| 175 |
+
This example is from the `Pos-Obj-Obj` subset of [MMSI-Bench](https://github.com/InternRobotics/MMSI-Bench):
|
| 176 |
+
|
| 177 |
+
```bash
|
| 178 |
+
python example.py \
|
| 179 |
+
--image_paths examples/Q1_1.png examples/Q1_2.png \
|
| 180 |
+
--question "<image><image>
|
| 181 |
+
You are standing in front of the dice pattern and observing it. Where is the desk lamp approximately located relative to you?
|
| 182 |
+
Options: A: 90 degrees counterclockwise, B: 90 degrees clockwise, C: 135 degrees counterclockwise, D: 135 degrees clockwise" \
|
| 183 |
+
--model_path sensenova/SenseNova-SI-1.1-InternVL3-8B
|
| 184 |
+
# --model_path OpenGVLab/InternVL3-8B
|
| 185 |
+
```
|
| 186 |
+
|
| 187 |
+
<!-- Example 1 -->
|
| 188 |
+
<details open>
|
| 189 |
+
<summary><strong>Details of Example 1</strong></summary>
|
| 190 |
+
<p><strong>Q:</strong> <image><image>
|
| 191 |
+
You are standing in front of the dice pattern and observing it. Where is the desk lamp approximately located relative to you?
|
| 192 |
+
Options: A: 90 degrees counterclockwise, B: 90 degrees clockwise, C: 135 degrees counterclockwise, D: 135 degrees clockwise</p>
|
| 193 |
+
<table>
|
| 194 |
+
<tr>
|
| 195 |
+
<td align="center" width="50%" style="padding:4px;">
|
| 196 |
+
<img src="./examples/Q1_1.png" alt="First image" width="100%">
|
| 197 |
+
</td>
|
| 198 |
+
<td align="center" width="50%" style="padding:4px;">
|
| 199 |
+
<img src="./examples/Q1_2.png" alt="Second image" width="100%">
|
| 200 |
+
</td>
|
| 201 |
+
</tr>
|
| 202 |
+
</table>
|
| 203 |
+
<p><strong>GT: C</strong></p>
|
| 204 |
+
</details>
|
| 205 |
+
|
| 206 |
+
|
| 207 |
+
#### Example 2
|
| 208 |
+
|
| 209 |
+
This example is from the `Rotation` subset of [MindCube](https://mind-cube.github.io/):
|
| 210 |
+
|
| 211 |
+
```bash
|
| 212 |
+
python example.py \
|
| 213 |
+
--image_paths examples/Q2_1.png examples/Q2_2.png \
|
| 214 |
+
--question "<image><image>
|
| 215 |
+
Based on these two views showing the same scene: in which direction did I move from the first view to the second view?
|
| 216 |
+
A. Directly left B. Directly right C. Diagonally forward and right D. Diagonally forward and left" \
|
| 217 |
+
--model_path sensenova/SenseNova-SI-1.1-InternVL3-8B
|
| 218 |
+
# --model_path OpenGVLab/InternVL3-8B
|
| 219 |
+
```
|
| 220 |
+
|
| 221 |
+
<!-- Example 2 -->
|
| 222 |
+
<details open>
|
| 223 |
+
<summary><strong>Details of Example 2</strong></summary>
|
| 224 |
+
<p><strong>Q:</strong> Based on these two views showing the same scene: in which direction did I move from the first view to the second view?
|
| 225 |
+
Directly left B. Directly right C. Diagonally forward and right D. Diagonally forward and left</p>
|
| 226 |
+
<table>
|
| 227 |
+
<tr>
|
| 228 |
+
<td align="center" width="50%" style="padding:4px;">
|
| 229 |
+
<img src="./examples/Q2_1.png" alt="First image" width="100%">
|
| 230 |
+
</td>
|
| 231 |
+
<td align="center" width="50%" style="padding:4px;">
|
| 232 |
+
<img src="./examples/Q2_2.png" alt="Second image" width="100%">
|
| 233 |
+
</td>
|
| 234 |
+
</tr>
|
| 235 |
+
</table>
|
| 236 |
+
<p><strong>GT: D</strong></p>
|
| 237 |
+
</details>
|
| 238 |
+
|
| 239 |
+
|
| 240 |
+
#### Test Multiple Questions in a Single Run
|
| 241 |
+
|
| 242 |
+
Prepare a file similar to [examples/examples.jsonl](examples/examples.jsonl), where each line represents a single question.
|
| 243 |
+
|
| 244 |
+
The model is loaded once and processes questions sequentially. The questions remain independent of each other.
|
| 245 |
+
|
| 246 |
+
> For more details on the `jsonl` format, refer to the documentation for [Single-Image Data](https://internvl.readthedocs.io/en/latest/get_started/chat_data_format.html#single-image-data) and [Multi-Image Data](https://internvl.readthedocs.io/en/latest/get_started/chat_data_format.html#multi-image-data).
|
| 247 |
+
|
| 248 |
+
|
| 249 |
+
```bash
|
| 250 |
+
python example.py \
|
| 251 |
+
--jsonl_path examples/examples.jsonl \
|
| 252 |
+
--model_path sensenova/SenseNova-SI-1.1-InternVL3-8B
|
| 253 |
+
# --model_path OpenGVLab/InternVL3-8B
|
| 254 |
+
```
|
| 255 |
+
|
| 256 |
+
### Evaluation
|
| 257 |
+
|
| 258 |
+
To reproduce the benchmark results above, please refer to [EASI](https://github.com/EvolvingLMMs-Lab/EASI) to evaluate SenseNova-SI on mainstream spatial intelligence benchmarks.
|
| 259 |
+
|
| 260 |
+
|
| 261 |
+
## 🖊️ Citation
|
| 262 |
+
|
| 263 |
+
```bib
|
| 264 |
+
@article{sensenova-si,
|
| 265 |
+
title = {Scaling Spatial Intelligence with Multimodal Foundation Models},
|
| 266 |
+
author = {Cai, Zhongang and Wang, Ruisi and Gu, Chenyang and Pu, Fanyi and Xu, Junxiang and Wang, Yubo and Yin, Wanqi and Yang, Zhitao and Wei, Chen and Sun, Qingping and Zhou, Tongxi and Li, Jiaqi and Pang, Hui En and Qian, Oscar and Wei, Yukun and Lin, Zhiqian and Shi, Xuanke and Deng, Kewang and Han, Xiaoyang and Chen, Zukai and Fan, Xiangyu and Deng, Hanming and Lu, Lewei and Pan, Liang and Li, Bo and Liu, Ziwei and Wang, Quan and Lin, Dahua and Yang, Lei},
|
| 267 |
+
journal = {arXiv preprint arXiv:2511.13719},
|
| 268 |
+
year = {2025}
|
| 269 |
+
}
|
| 270 |
+
```
|