---
base_model:
- OpenGVLab/InternVL3-2B
license: apache-2.0
pipeline_tag: image-text-to-text
library_name: transformers
language: en
---
**EN** | [中文](README_CN.md)
# SenseNova-SI: Scaling Spatial Intelligence with Multimodal Foundation Models
## Paper
The model was presented in the paper [Scaling Spatial Intelligence with Multimodal Foundation Models](https://arxiv.org/abs/2511.13719).
## Code
The official code can be found on the [OpenSenseNova/SenseNova-SI GitHub repository](https://github.com/OpenSenseNova/SenseNova-SI).
## Overview
Despite remarkable progress, multimodal foundation models still exhibit surprising deficiencies in spatial intelligence.
In this work, we explore scaling up multimodal foundation models to cultivate spatial intelligence within the **SenseNova-SI family**,
built upon established multimodal foundations including visual understanding models (i.e., Qwen3-VL and InternVL3) and unified understanding and generation models (i.e., Bagel).
We take a principled approach to constructing high-performing and robust spatial intelligence by systematically curating SenseNova-SI-8M:
eight million diverse data samples under a rigorous taxonomy of spatial capabilities.
SenseNova-SI demonstrates unprecedented performance across a broad range of spatial intelligence benchmarks: 68.7% on VSI-Bench, 43.3% on MMSI, 85.6% on MindCube,
54.6% on ViewSpatial, and 50.1% on SITE, while maintaining strong general multimodal understanding (e.g., 84.9% on MMBench-En).
More importantly, we analyze the impact of data scaling, discuss early signs of emergent generalization capabilities enabled by diverse data training,
analyze the risk of overfitting and language shortcuts, present a preliminary study on spatial chain-of-thought reasoning, and validate the potential downstream application. SenseNova-SI is an ongoing project, and this report will be updated continuously.
All newly trained multimodal foundation models are publicly released to facilitate further research in this direction.
*In the future, SenseNova-SI will be integrated with larger-scale in-house models.*
## Release Information
Currently, we build SenseNova-SI upon popular open-source foundation models to maximize compatibility with existing research pipelines.
In this release, we present
[**SenseNova-SI-1.1-InternVL3-2B**](https://huggingface.co/sensenova/SenseNova-SI-1.1-InternVL3-2B) and
[**SenseNova-SI-1.1-InternVL3-8B**](https://huggingface.co/sensenova/SenseNova-SI-1.1-InternVL3-8B),
which achieve state-of-the-art performance among open-source models of comparable size across five recent spatial intelligence benchmarks:
**VSI**, **MMSI**, **MindCube**, **ViewSpatial** and **SITE**.
| Model | VSI | MMSI | MindCube-Tiny | ViewSpatial | SITE |
|---|---|---|---|---|---|
| Open-source Models (~2B) | |||||
| InternVL3-2B | 32.9 | 26.5 | 37.5 | 32.5 | 30.0 |
| Qwen3-VL-2B-Instruct | 50.3 | 28.9 | 34.5 | 36.9 | 35.6 |
| MindCube-3B-RawQA-SFT | 17.2 | 1.7 | 51.7 | 24.1 | 6.3 |
| SpatialLadder-3B | 44.8 | 27.4 | 43.4 | 39.8 | 27.9 |
| SpatialMLLM-4B | 46.3 | 26.1 | 33.4 | 34.6 | 18.0 |
| VST-3B-SFT | 57.9 | 30.2 | 35.9 | 52.8 | 35.8 |
| Cambrian-S-3B | 57.3 | 25.2 | 32.5 | 39.0 | 28.3 |
| SenseNova-SI-1.1-InternVL3-2B | 63.7 | 34.2 | 41.8 | 52.6 | 36.7 |
| Open-source Models (~8B) | |||||
| InternVL3-8B | 42.1 | 28.0 | 41.5 | 38.6 | 41.1 |
| Qwen3-VL-8B-Instruct | 57.9 | 31.1 | 29.4 | 42.2 | 45.8 |
| BAGEL-7B-MoT | 31.4 | 31.0 | 34.7 | 41.3 | 37.0 |
| SpaceR-7B | 41.5 | 27.4 | 37.9 | 35.8 | 34.2 |
| ViLaSR-7B | 44.6 | 30.2 | 35.1 | 35.7 | 38.7 |
| VST-7B-SFT | 60.6 | 32.0 | 39.7 | 50.5 | 39.6 |
| Cambrian-S-7B | 67.5 | 25.8 | 39.6 | 40.9 | 33.0 |
| SenseNova-SI-1.1-InternVL3-8B | 68.7 | 43.3 | 85.6 | 54.6 | 47.7 |
| Proprietary Models | |||||
| Gemini-2.5-pro-2025-06 | 53.5 | 38.0 | 57.6 | 46.0 | 57.0 |
| Grok-4-2025-07-09 | 47.9 | 37.8 | 63.5 | 43.2 | 47.0 |
| GPT-5-2025-08-07 | 55.0 | 41.8 | 56.3 | 45.5 | 61.8 |
Q:
|
|
GT: C
Q: Based on these two views showing the same scene: in which direction did I move from the first view to the second view? Directly left B. Directly right C. Diagonally forward and right D. Diagonally forward and left
|
|
GT: D