WoW: Towards a World omniscient World model Through Embodied Interaction
Abstract
WoW, a 14-billion-parameter generative world model trained on robot interactions, demonstrates improved physical intuition through SOPHIA's guidance and achieves state-of-the-art performance on physical consistency and causal reasoning in video.
Humans develop an understanding of intuitive physics through active interaction with the world. This approach is in stark contrast to current video models, such as Sora, which rely on passive observation and therefore struggle with grasping physical causality. This observation leads to our central hypothesis: authentic physical intuition of the world model must be grounded in extensive, causally rich interactions with the real world. To test this hypothesis, we present WoW, a 14-billion-parameter generative world model trained on 2 million robot interaction trajectories. Our findings reveal that the model's understanding of physics is a probabilistic distribution of plausible outcomes, leading to stochastic instabilities and physical hallucinations. Furthermore, we demonstrate that this emergent capability can be actively constrained toward physical realism by SOPHIA, where vision-language model agents evaluate the DiT-generated output and guide its refinement by iteratively evolving the language instructions. In addition, a co-trained Inverse Dynamics Model translates these refined plans into executable robotic actions, thus closing the imagination-to-action loop. We establish WoWBench, a new benchmark focused on physical consistency and causal reasoning in video, where WoW achieves state-of-the-art performance in both human and autonomous evaluation, demonstrating strong ability in physical causality, collision dynamics, and object permanence. Our work provides systematic evidence that large-scale, real-world interaction is a cornerstone for developing physical intuition in AI. Models, data, and benchmarks will be open-sourced.
Community
How can an AI truly know our world? The philosopher Jean Piaget argued, "To know an object is to act on it". Is it enough for an AI to passively watch endless videos—observing mere shadows on a cave wall—or must it step into the world and learn by doing?
We argue for the latter. Current video models master the appearance of reality, but their grasp of causality remains brittle because they are only observers.
This philosophy is the foundation of our work, WoW (World-Omniscient World-Model)—a model that constructs its understanding of physics not from passive observation, but from embodied experience.
Here is what makes WoW a fundamental shift:
- Learning Grounded in Action: We trained our 14B parameter model on an unprecedented 2 million real-world robot interaction trajectories, forcing it to learn physics by directly acting upon the world. This principle of learning through a perception-action loop is central to our cognitive architecture.
- A Cognitive Loop for Self-Correction: WoW engages in cognitive self-reflection through our SOPHIA framework. It first imagines a future by generating a video. An integrated VLM-based critic then evaluates whether this imagined future is physically plausible, allowing the model to reflect and refine its own understanding in a closed loop.
- Closing the Imagination-to-Action Loop: WoW translates its internal thoughts into physical reality. Using a novel Inverse Dynamics Model, it converts its refined, physically-grounded imaginations directly into executable actions for a physical robot. This grounds abstract knowledge in tangible consequences.
We believe WoW is more than a next-generation video model. It is a blueprint for a new class of AI that moves beyond statistical pattern matching toward a verifiable, causal understanding of reality. It represents the nascent form of a true world model: one that learns, reasons, reflects, and ultimately, acts.
We have also open-sourced WoWBench, our new benchmark for evaluating physical and causal reasoning in world models, to facilitate community progress. We invite you to explore our work and join the conversation on the future of embodied cognition.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation (2025)
- Learning Primitive Embodied World Models: Towards Scalable Robotic Learning (2025)
- F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions (2025)
- From Watch to Imagine: Steering Long-horizon Manipulation via Human Demonstration and Future Envisionment (2025)
- MoWM: Mixture-of-World-Models for Embodied Planning via Latent-to-Pixel Feature Modulation (2025)
- PhysicalAgent: Towards General Cognitive Robotics with Foundation World Models (2025)
- Spatial Policy: Guiding Visuomotor Robotic Manipulation with Spatial-Aware Modeling and Reasoning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 2
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper




