Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding
Abstract
A video diffusion model is repurposed as a latent world simulator to enhance multimodal large language models with implicit 3D structural priors and physical laws through spatiotemporal feature extraction and semantic integration.
While Multimodal Large Language Models demonstrate impressive semantic capabilities, they often suffer from spatial blindness, struggling with fine-grained geometric reasoning and physical dynamics. Existing solutions typically rely on explicit 3D modalities or complex geometric scaffolding, which are limited by data scarcity and generalization challenges. In this work, we propose a paradigm shift by leveraging the implicit spatial prior within large-scale video generation models. We posit that to synthesize temporally coherent videos, these models inherently learn robust 3D structural priors and physical laws. We introduce VEGA-3D (Video Extracted Generative Awareness), a plug-and-play framework that repurposes a pre-trained video diffusion model as a Latent World Simulator. By extracting spatiotemporal features from intermediate noise levels and integrating them with semantic representations via a token-level adaptive gated fusion mechanism, we enrich MLLMs with dense geometric cues without explicit 3D supervision. Extensive experiments across 3D scene understanding, spatial reasoning, and embodied manipulation benchmarks demonstrate that our method outperforms state-of-the-art baselines, validating that generative priors provide a scalable foundation for physical-world understanding. Code is publicly available at https://github.com/H-EmbodVis/VEGA-3D.
Community
While Multimodal Large Language Models demonstrate impressive semantic capabilities, they often suffer from spatial blindness, struggling with fine-grained geometric reasoning and physical dynamics. Existing solutions typically rely on explicit 3D modalities or complex geometric scaffolding, which are limited by data scarcity and generalization challenges. In this work, we propose a paradigm shift by leveraging the implicit spatial prior within large-scale video generation models. We posit that to synthesize temporally coherent videos, these models inherently learn robust 3D structural priors and physical laws. We introduce VEGA-3D (Video Extracted Generative Awareness), a plug-and-play framework that repurposes a pre-trained video diffusion model as a Latent World Simulator. By extracting spatiotemporal features from intermediate noise levels and integrating them with semantic representations via a token-level adaptive gated fusion mechanism, we enrich MLLMs with dense geometric cues without explicit 3D supervision. Extensive experiments across 3D scene understanding, spatial reasoning, and embodied manipulation benchmarks demonstrate that our method outperforms state-of-the-art baselines, validating that generative priors provide a scalable foundation for physical-world understanding.
Heartfelt thanks to Xianjin Wu for his absolutely phenomenal and groundbreaking contribution!
This paper is nothing short of spectacular — the quality, depth, and creativity are simply off the charts. An incredible achievement.
the trick of turning a frozen video diffusion model into a latent world simulator to boost MLLMs is a neat bridge between generative priors and discriminative grounding. it vibes with DreamFusion-style ideas, but VEGA-3D keeps the generator frozen and relies on mid-denoise features to inject geometry without explicit 3D supervision. the real juice, imho, is the token-level adaptive gated fusion that marries semantic representations with these generative priors, letting geometry cues steer downstream grounding while preserving language understanding. btw the arxivlens breakdown helped me parse the method details, and they do a nice job unpacking the role of intermediate features and layer selection (https://arxivlens.com/PaperView/Details/generation-models-know-space-unleashing-implicit-3d-priors-for-scene-understanding-4939-bc0a17ca). it would be interesting to see ablations against explicit 3D supervision in occluded or cluttered scenes to pin down when latent priors actually carry the load.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Spa3R: Predictive Spatial Field Modeling for 3D Visual Reasoning (2026)
- Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models (2026)
- S-VAM: Shortcut Video-Action Model by Self-Distilling Geometric and Semantic Foresight (2026)
- PhysAlign: Physics-Coherent Image-to-Video Generation through Feature and 3D Representation Alignment (2026)
- Cog2Gen3D: Sculpturing 3D Semantic-Geometric Cognition for 3D Generation (2026)
- VG3S: Visual Geometry Grounded Gaussian Splatting for Semantic Occupancy Prediction (2026)
- 3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper