Spaces:
Running
Running
license: cc-by-sa-4.0 | |
title: 'SpatialVID: A Large Scale Video Dataset with Spatial Annotations' | |
sdk: static | |
emoji: 📚 | |
colorFrom: red | |
<h1 align='center'>SpatialVID: A Large-Scale Video Dataset with Spatial Annotations</h1> | |
<div align='center'> | |
<a href='https://oiiiwjh.github.io/' target='_blank'>Jiahao Wang</a><sup>1*</sup> | |
<a href='https://github.com/FelixYuan-YF' target='_blank'>Yufeng Yuan</a><sup>1*</sup> | |
<a href='https://github.com/zrj-cn' target='_blank'>Rujie Zheng</a><sup>1*</sup> | |
<a href='https://linyou.github.io' target='_blank'>Youtian Lin</a><sup>1</sup> | |
<a href='https://ygaojiany.github.io' target='_blank'>Jian Gao</a><sup>1</sup> | |
<a href='https://linzhuo.xyz' target='_blank'>Lin-Zhuo Chen</a><sup>1</sup> | |
</div> | |
<div align='center'> | |
<a href='https://openreview.net/profile?id=~yajie_bao5' target='_blank'>Yajie Bao</a><sup>1</sup> | |
<a href='https://github.com/YeeZ93' target='_blank'>Yi Zhang</a><sup>1</sup> | |
<a href='#' target='_blank'>Chang Zeng</a><sup>1</sup> | |
<a href='https://github.com/yxzhou217' target='_blank'>Yanxi Zhou</a><sup>1</sup> | |
<a href='https://www.xxlong.site/index.html' target='_blank'>Xiaoxiao Long</a><sup>1</sup> | |
<a href='http://zhuhao.cc/home/' target='_blank'>Hao Zhu</a><sup>1</sup> | |
</div> | |
<div align='center'> | |
<a href='http://zhaoxiangzhang.net/' target='_blank'>Zhaoxiang Zhang</a><sup>2</sup> | |
<a href='https://cite.nju.edu.cn/People/Faculty/20190621/i5054.html' target='_blank'>Xun Cao</a><sup>1</sup> | |
<a href='https://yoyo000.github.io/' target='_blank'>Yao Yao</a><sup>1†</sup> | |
</div> | |
<div align='center'> | |
<sup>1</sup>Nanjing University <sup>2</sup>Institute of Automation, Chinese Academy of Science | |
</div> | |
<br> | |
<div align="center"> | |
<a href="https://nju-3dv.github.io/projects/SpatialVID/"><img src="https://img.shields.io/static/v1?label=SpatialVID&message=Project&color=purple"></a> | |
<a href="#"><img src="https://img.shields.io/static/v1?label=Paper&message=Arxiv&color=red&logo=arxiv"></a> | |
<a href="https://github.com/NJU-3DV/spatialVID"><img src="https://img.shields.io/static/v1?label=Code&message=Github&color=blue&logo=github"></a> | |
<a href="https://huggingface.co/SpatialVID"><img src="https://img.shields.io/static/v1?label=Dataset&message=HuggingFace&color=yellow&logo=huggingface"></a> | |
</div> | |
<p align="center"> | |
<img src="assets/overview.png" height=400> | |
</p> | |
## Abstract | |
Significant progress has been made in spatial intelligence, spanning both spatial reconstruction and world exploration. However, the scalability and real-world fidelity of current models remain severely constrained by the scarcity of large-scale, high-quality training data. While several datasets provide camera pose information, they are typically limited in scale, diversity, and annotation richness, particularly for real-world dynamic scenes with ground-truth camera motion. To this end, we collect **SpatialVID**, a dataset consists of a large corpus of in-the-wild videos with diverse scenes, camera movements and dense 3D annotations such as per-frame camera poses, depth, and motion instructions. Specifically, we collect more than **21,000 hours** of raw video, and process them into **2.7 million clips** clips through a hierarchical filtering pipeline, totaling **7,089 hours** of dynamic content. A subsequent annotation pipeline enriches these clips with detailed spatial and semantic information, including camera poses, depth maps, dynamic masks, structured captions, and serialized motion instructions. Analysis of SpatialVID's data statistics reveals a richness and diversity that directly foster improved model generalization and performance, establishing it as a key asset for the video and 3D vision research community. | |
## Demonstration | |
<table class="center" style="width: 100%; border-collapse: separate; border-spacing: 8px;"> | |
<!-- Row 1 --> | |
<tr> | |
<td style="width: 33.33%; border: none; padding-top: 16px; text-align: center; vertical-align: top;"> | |
<img src="assets/sample1.gif" alt="Sample 1 Original" style="max-width: 100%; height: auto; border-radius: 4px;"> | |
</td> | |
<td style="width: 33.33%; border: none; padding-top: 16px; text-align: center; vertical-align: top;"> | |
<img src="assets/sample1_depth.gif" alt="Sample 1 Depth" style="max-width: 100%; height: auto; border-radius: 4px;"> | |
</td> | |
<td style="width: 33.33%; border: none; padding-top: 16px; text-align: center; vertical-align: top;"> | |
<img src="assets/sample1_pose.gif" alt="Sample 1 Pose" style="max-width: 100%; height: auto; border-radius: 4px;"> | |
</td> | |
</tr> | |
<tr> | |
<td colspan="3" style="border: none; padding: 8px 12px; vertical-align: top; word-wrap: break-word; line-height: 1.5;"> | |
<p style="margin: 0 0 8px 0;"><span style="font-weight: 600; color: #2d3748;">Scene Abstract:</span> A sunlit Italian courtyard surrounded by stone buildings, with climbing plants, flower pots, and a staircase leading upward, evoking a peaceful, timeless atmosphere.</p> | |
<p style="margin: 0 0 8px 0;"><span style="font-weight: 600; color: #2d3748;">Scene Description:</span> The scene depicts a quaint, sunlit courtyard nestled within a historic Italian village. Stone buildings with shuttered windows enclose the space, their walls adorned with climbing plants and colorful flower pots. A staircase, also decorated with potted plants, leads to an upper level. The courtyard is paved with light-colored stones, reflecting the bright sunlight. The overall atmosphere is peaceful and charming, evoking a sense of timeless beauty and tranquility.</p> | |
<p style="margin: 0 0 8px 0;"><span style="font-weight: 600; color: #2d3748;">Immersive Shot Summary:</span> The camera moves smoothly forward through a shadowed alley, its path tilting slightly downward as it weaves between weathered stone walls. As it emerges into a sun-drenched courtyard, the scene unfolds—stone steps lined with blooming flowers, soft light dancing on ancient stonework, and a quiet, timeless charm enveloping the space.</p> | |
<p style="margin: 0;"><span style="font-weight: 600; color: #2d3748;">Camera Description:</span> The camera glides steadily forward, moving through a narrow passage as it translates rightward and slightly downward. The motion remains smooth and consistent, gradually revealing an open courtyard with stone steps and potted plants. The forward movement slows as the camera reaches the space, offering a wide view of the historic setting.</p> | |
</td> | |
</tr> | |
<!-- Row 2 --> | |
<tr> | |
<td style="width: 33.33%; border: none; padding-top: 16px; text-align: center; vertical-align: top;"> | |
<img src="assets/sample2.gif" alt="Sample 2 Original" style="max-width: 100%; height: auto; border-radius: 4px;"> | |
</td> | |
<td style="width: 33.33%; border: none; padding-top: 16px; text-align: center; vertical-align: top;"> | |
<img src="assets/sample2_depth.gif" alt="Sample 2 Depth" style="max-width: 100%; height: auto; border-radius: 4px;"> | |
</td> | |
<td style="width: 33.33%; border: none; padding-top: 16px; text-align: center; vertical-align: top;"> | |
<img src="assets/sample2_pose.gif" alt="Sample 2 Pose" style="max-width: 100%; height: auto; border-radius: 4px;"> | |
</td> | |
</tr> | |
<tr> | |
<td colspan="3" style="border: none; padding: 8px 12px; vertical-align: top; word-wrap: break-word; line-height: 1.5;"> | |
<p style="margin: 0 0 8px 0;"><span style="font-weight: 600; color: #2d3748;">Scene Abstract:</span> A modern, sunlit living area features high ceilings, wooden beams, and sleek furniture, evoking a sense of comfort and sophistication.</p> | |
<p style="margin: 0 0 8px 0;"><span style="font-weight: 600; color: #2d3748;">Scene Description:</span> The scene depicts a spacious, modern living area with high ceilings and exposed wooden beams. Large windows offer a view of an outdoor patio and landscape. Two white sofas face each other, flanking a light-colored coffee table. Dark leather armchairs sit near a dining area with a long table and chairs. A kitchen island with bar stools is visible in the background. The room is brightly lit, creating a warm and inviting atmosphere. The overall tone is luxurious and comfortable.</p> | |
<p style="margin: 0 0 8px 0;"><span style="font-weight: 600; color: #2d3748;">Immersive Shot Summary:</span> The camera drifts forward through the airy, sun-drenched room, gliding past sleek sofas and a polished coffee table. As it moves left, the expansive space unfolds, highlighting the elegant design and warm ambiance of the luxurious living area.</p> | |
<p style="margin: 0;"><span style="font-weight: 600; color: #2d3748;">Camera Description:</span> The camera glides smoothly forward, gradually revealing the vast interior space. It then shifts left, sweeping across the open living area. The motion slows as it stabilizes, capturing the room’s luxurious details before subtly moving closer to the scene.</p> | |
</td> | |
</tr> | |
<!-- Row 3 --> | |
<tr> | |
<td style="width: 33.33%; border: none; padding-top: 16px; text-align: center; vertical-align: top;"> | |
<img src="assets/sample3.gif" alt="Sample 3 Original" style="max-width: 100%; height: auto; border-radius: 4px;"> | |
</td> | |
<td style="width: 33.33%; border: none; padding-top: 16px; text-align: center; vertical-align: top;"> | |
<img src="assets/sample3_depth.gif" alt="Sample 3 Depth" style="max-width: 100%; height: auto; border-radius: 4px;"> | |
</td> | |
<td style="width: 33.33%; border: none; padding-top: 16px; text-align: center; vertical-align: top;"> | |
<img src="assets/sample3_pose.gif" alt="Sample 3 Pose" style="max-width: 100%; height: auto; border-radius: 4px;"> | |
</td> | |
</tr> | |
<tr> | |
<td colspan="3" style="border: none; padding: 8px 12px; vertical-align: top; word-wrap: break-word; line-height: 1.5;"> | |
<p style="margin: 0 0 8px 0;"><span style="font-weight: 600; color: #2d3748;">Scene Abstract:</span> A towering mountain range stretches beneath a brooding sky, its jagged peaks shrouded in swirling clouds, evoking a sense of quiet majesty and natural grandeur.</p> | |
<p style="margin: 0 0 8px 0;"><span style="font-weight: 600; color: #2d3748;">Scene Description:</span> The scene showcases a dramatic mountain landscape with jagged peaks and steep cliffs. A thick layer of clouds fills the valleys below, creating a sense of depth and mystery. The rocky terrain is sparsely covered with patches of green vegetation. The lighting is soft and diffused, suggesting an overcast day. The overall tone is awe-inspiring and serene, emphasizing the grandeur and scale of nature. The scene evokes a feeling of tranquility and invites contemplation.</p> | |
<p style="margin: 0 0 8px 0;"><span style="font-weight: 600; color: #2d3748;">Immersive Shot Summary:</span> The camera surges forward through the air, descending along the rocky ridge as mist curls below. A gentle shift to the right reveals the sheer drop of the valley, the soft light casting long shadows across the barren slopes, capturing the raw beauty of the untamed landscape.</p> | |
<p style="margin: 0;"><span style="font-weight: 600; color: #2d3748;">Camera Description:</span> The camera glides forward and downward, tracing a dynamic path along the mountain ridge. It shifts slightly right as it descends, revealing the steep drop below. The motion is smooth and continuous, with a steady acceleration that emphasizes the vastness of the landscape.</p> | |
</td> | |
</tr> | |
<!-- Row 4 --> | |
<tr> | |
<td style="width: 33.33%; border: none; padding-top: 16px; text-align: center; vertical-align: top;"> | |
<img src="assets/sample4.gif" alt="Sample 4 Original" style="max-width: 100%; height: auto; border-radius: 4px;"> | |
</td> | |
<td style="width: 33.33%; border: none; padding-top: 16px; text-align: center; vertical-align: top;"> | |
<img src="assets/sample4_depth.gif" alt="Sample 4 Depth" style="max-width: 100%; height: auto; border-radius: 4px;"> | |
</td> | |
<td style="width: 33.33%; border: none; padding-top: 16px; text-align: center; vertical-align: top;"> | |
<img src="assets/sample4_pose.gif" alt="Sample 4 Pose" style="max-width: 100%; height: auto; border-radius: 4px;"> | |
</td> | |
</tr> | |
<tr> | |
<td colspan="3" style="border: none; padding: 8px 12px; vertical-align: top; word-wrap: break-word; line-height: 1.5;"> | |
<p style="margin: 0 0 8px 0;"><span style="font-weight: 600; color: #2d3748;">Scene Abstract:</span> A rainy night in Seoul features glistening streets, neon reflections, and bustling yet calm pedestrian activity amid a blend of traditional and modern architecture.</p> | |
<p style="margin: 0 0 8px 0;"><span style="font-weight: 600; color: #2d3748;">Scene Description:</span> A rainy night in Seoul, South Korea, is depicted with glistening streets reflecting neon lights. Pedestrians with umbrellas walk along the sidewalk. The storefronts are brightly lit, displaying clothing and currency exchange services. The atmosphere is calm and subdued, with the rain creating a reflective surface on the pavement. The scene conveys a sense of urban life continuing despite the weather, with a mix of traditional and modern elements in the architecture and signage.</p> | |
<p style="margin: 0 0 8px 0;"><span style="font-weight: 600; color: #2d3748;">Immersive Shot Summary:</span> The camera smoothly drifts right along a rain-slicked street, its path illuminated by glowing neon signs. Pedestrians with umbrellas move past brightly lit shops, their reflections shimmering on the wet pavement as the scene pulses with quiet urban energy.</p> | |
<p style="margin: 0;"><span style="font-weight: 600; color: #2d3748;">Camera Description:</span> The camera glides steadily to the right, moving through a wet urban landscape at night. It maintains a consistent pace, revealing storefronts and pedestrians under umbrellas, before gradually coming to a halt.</p> | |
</td> | |
</tr> | |
</table> | |
## Dataset Statistics | |
<p align="center" width="100%"> | |
<a target="_blank"><img src="assets/statistics1.jpg" style="width: 100%; min-width: 200px; display: block; margin: auto;"></a> | |
</p> | |
<p align="center" width="100%"> | |
<a target="_blank"><img src="assets/statistics2.jpg" style="width: 100%; min-width: 200px; display: block; margin: auto;"></a> | |
</p> | |
<p align="center" width="100%"> | |
<a target="_blank"><img src="assets/statistics3.jpg" style="width: 100%; min-width: 200px; display: block; margin: auto;"></a> | |
</p> | |
## Curation Pipeline | |
For more details about the dataset curation pipeline, please refer to our [GitHub Code](https://github.com/NJU-3DV/spatialVID). | |
## License of SpatialVID | |
SpatialVID is released under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC-BY-NC-SA-4.0). Users must attribute the original source, use the resource only for non-commercial purposes, and release any modified/derived works under the same license. For the full license text, visit https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode. | |
## Citation | |
If you find this project useful for your research, please cite our paper. | |
```bibtex | |
@article{wang2025spatialvid, | |
title={SpatialVID: A Large-Scale Video Dataset with Spatial Annotations}, | |
author={Jiahao Wang and Yufeng Yuan and Rujie Zheng and Youtian Lin and Jian Gao and Lin-Zhuo Chen and Yajie Bao and Chang Zeng and Yanxi Zhou and Yi Zhang and Xiaoxiao Long and Hao Zhu and Zhaoxiang Zhang and Xun Cao and Yao Yao}, | |
journal={arXiv}, | |
year={2025} | |
} | |
``` |