SpatialVID

non-profit

AI & ML interests

None defined yet.

Recent Activity

FelixYuan  updated a dataset about 3 hours ago
SpatialVID/SpatialVID
FelixYuan  updated a Space about 3 hours ago
SpatialVID/README
FelixYuan  updated a dataset about 3 hours ago
SpatialVID/SpatialVID-HQ
View all activity

SpatialVID: A Large-Scale Video Dataset with Spatial Annotations

1Nanjing University  2Institute of Automation, Chinese Academy of Science 

Abstract

Significant progress has been made in spatial intelligence, spanning both spatial reconstruction and world exploration. However, the scalability and real-world fidelity of current models remain severely constrained by the scarcity of large-scale, high-quality training data. While several datasets provide camera pose information, they are typically limited in scale, diversity, and annotation richness, particularly for dynamic scenes with realistic camera motion. To address this gap, we collect a large corpus of raw video with natural camera movement, providing the foundation for constructing a dataset with unique scale and diversity. In this work, we introduce SpatialVID, a large-scale dynamic spatial dataset explicitly designed to provide expressive annotations for this purpose. Through a hierarchical filtering pipeline, we process more than 21,000 hours of collected raw video into 2.7 million clips, totaling 7,089 hours of dynamic content. A subsequent annotation pipeline enriches these clips with detailed spatial and semantic information, including camera poses, depth maps, dynamic masks, structured captions, and labels for camera motion and scene composition.

Demonstration

Sample 1 Original Sample 1 Depth Sample 1 Pose

Scene Abstract: A sunlit Italian courtyard surrounded by stone buildings, with climbing plants, flower pots, and a staircase leading upward, evoking a peaceful, timeless atmosphere.

Scene Description: The scene depicts a quaint, sunlit courtyard nestled within a historic Italian village. Stone buildings with shuttered windows enclose the space, their walls adorned with climbing plants and colorful flower pots. A staircase, also decorated with potted plants, leads to an upper level. The courtyard is paved with light-colored stones, reflecting the bright sunlight. The overall atmosphere is peaceful and charming, evoking a sense of timeless beauty and tranquility.

Immersive Shot Summary: The camera moves smoothly forward through a shadowed alley, its path tilting slightly downward as it weaves between weathered stone walls. As it emerges into a sun-drenched courtyard, the scene unfolds—stone steps lined with blooming flowers, soft light dancing on ancient stonework, and a quiet, timeless charm enveloping the space.

Camera Description: The camera glides steadily forward, moving through a narrow passage as it translates rightward and slightly downward. The motion remains smooth and consistent, gradually revealing an open courtyard with stone steps and potted plants. The forward movement slows as the camera reaches the space, offering a wide view of the historic setting.

Sample 2 Original Sample 2 Depth Sample 2 Pose

Scene Abstract: A modern, sunlit living area features high ceilings, wooden beams, and sleek furniture, evoking a sense of comfort and sophistication.

Scene Description: The scene depicts a spacious, modern living area with high ceilings and exposed wooden beams. Large windows offer a view of an outdoor patio and landscape. Two white sofas face each other, flanking a light-colored coffee table. Dark leather armchairs sit near a dining area with a long table and chairs. A kitchen island with bar stools is visible in the background. The room is brightly lit, creating a warm and inviting atmosphere. The overall tone is luxurious and comfortable.

Immersive Shot Summary: The camera drifts forward through the airy, sun-drenched room, gliding past sleek sofas and a polished coffee table. As it moves left, the expansive space unfolds, highlighting the elegant design and warm ambiance of the luxurious living area.

Camera Description: The camera glides smoothly forward, gradually revealing the vast interior space. It then shifts left, sweeping across the open living area. The motion slows as it stabilizes, capturing the room’s luxurious details before subtly moving closer to the scene.

Sample 3 Original Sample 3 Depth Sample 3 Pose

Scene Abstract: A towering mountain range stretches beneath a brooding sky, its jagged peaks shrouded in swirling clouds, evoking a sense of quiet majesty and natural grandeur.

Scene Description: The scene showcases a dramatic mountain landscape with jagged peaks and steep cliffs. A thick layer of clouds fills the valleys below, creating a sense of depth and mystery. The rocky terrain is sparsely covered with patches of green vegetation. The lighting is soft and diffused, suggesting an overcast day. The overall tone is awe-inspiring and serene, emphasizing the grandeur and scale of nature. The scene evokes a feeling of tranquility and invites contemplation.

Immersive Shot Summary: The camera surges forward through the air, descending along the rocky ridge as mist curls below. A gentle shift to the right reveals the sheer drop of the valley, the soft light casting long shadows across the barren slopes, capturing the raw beauty of the untamed landscape.

Camera Description: The camera glides forward and downward, tracing a dynamic path along the mountain ridge. It shifts slightly right as it descends, revealing the steep drop below. The motion is smooth and continuous, with a steady acceleration that emphasizes the vastness of the landscape.

Sample 4 Original Sample 4 Depth Sample 4 Pose

Scene Abstract: A rainy night in Seoul features glistening streets, neon reflections, and bustling yet calm pedestrian activity amid a blend of traditional and modern architecture.

Scene Description: A rainy night in Seoul, South Korea, is depicted with glistening streets reflecting neon lights. Pedestrians with umbrellas walk along the sidewalk. The storefronts are brightly lit, displaying clothing and currency exchange services. The atmosphere is calm and subdued, with the rain creating a reflective surface on the pavement. The scene conveys a sense of urban life continuing despite the weather, with a mix of traditional and modern elements in the architecture and signage.

Immersive Shot Summary: The camera smoothly drifts right along a rain-slicked street, its path illuminated by glowing neon signs. Pedestrians with umbrellas move past brightly lit shops, their reflections shimmering on the wet pavement as the scene pulses with quiet urban energy.

Camera Description: The camera glides steadily to the right, moving through a wet urban landscape at night. It maintains a consistent pace, revealing storefronts and pedestrians under umbrellas, before gradually coming to a halt.

Dataset Statistics

Curation Pipeline

For more details about the dataset curation pipeline, please refer to our GitHub Code.

License of SpatialVID

SpatialVID is released under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC-BY-NC-SA-4.0). Users must attribute the original source, use the resource only for non-commercial purposes, and release any modified/derived works under the same license. For the full license text, visit https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode.

Citation

If you find this project useful for your research, please cite our paper.

@article{wang2025spatialvid,
  title={SpatialVID: A Large-Scale Video Dataset with Spatial Annotations},
  author={Jiahao Wang and Yufeng Yuan and Rujie Zheng and Youtian Lin and Jian Gao and Lin-Zhuo Chen and Yajie Bao and Chang Zeng and Yanxi Zhou and Yi Zhang and Xiaoxiao Long and Hao Zhu and Zhaoxiang Zhang and Xun Cao and Yao Yao},
  journal={arXiv},
  year={2025}
}

models 0

None public yet