arxiv:2509.22642

WoW: Towards a World omniscient World model Through Embodied Interaction

Published on Sep 26

· Submitted by

yong on Sep 29

Upvote

Authors:

Peidong Jia ,

Weishi Mi ,

Kevin Zhang ,

Wanxin Tian ,

Kuangzhi Ge ,

Hao Li ,

Zezhong Qian ,

Chengyu Bai ,

Abstract

WoW, a 14-billion-parameter generative world model trained on robot interactions, demonstrates improved physical intuition through SOPHIA's guidance and achieves state-of-the-art performance on physical consistency and causal reasoning in video.

AI-generated summary

Humans develop an understanding of intuitive physics through active interaction with the world. This approach is in stark contrast to current video models, such as Sora, which rely on passive observation and therefore struggle with grasping physical causality. This observation leads to our central hypothesis: authentic physical intuition of the world model must be grounded in extensive, causally rich interactions with the real world. To test this hypothesis, we present WoW, a 14-billion-parameter generative world model trained on 2 million robot interaction trajectories. Our findings reveal that the model's understanding of physics is a probabilistic distribution of plausible outcomes, leading to stochastic instabilities and physical hallucinations. Furthermore, we demonstrate that this emergent capability can be actively constrained toward physical realism by SOPHIA, where vision-language model agents evaluate the DiT-generated output and guide its refinement by iteratively evolving the language instructions. In addition, a co-trained Inverse Dynamics Model translates these refined plans into executable robotic actions, thus closing the imagination-to-action loop. We establish WoWBench, a new benchmark focused on physical consistency and causal reasoning in video, where WoW achieves state-of-the-art performance in both human and autonomous evaluation, demonstrating strong ability in physical causality, collision dynamics, and object permanence. Our work provides systematic evidence that large-scale, real-world interaction is a cornerstone for developing physical intuition in AI. Models, data, and benchmarks will be open-sourced.

View arXiv page View PDF Add to collection

Community

dyong

Paper submitter 25 days ago

How can an AI truly know our world? The philosopher Jean Piaget argued, "To know an object is to act on it". Is it enough for an AI to passively watch endless videos—observing mere shadows on a cave wall—or must it step into the world and learn by doing?

We argue for the latter. Current video models master the appearance of reality, but their grasp of causality remains brittle because they are only observers.

This philosophy is the foundation of our work, WoW (World-Omniscient World-Model)—a model that constructs its understanding of physics not from passive observation, but from embodied experience.

Here is what makes WoW a fundamental shift:

Learning Grounded in Action: We trained our 14B parameter model on an unprecedented 2 million real-world robot interaction trajectories, forcing it to learn physics by directly acting upon the world. This principle of learning through a perception-action loop is central to our cognitive architecture.

A Cognitive Loop for Self-Correction: WoW engages in cognitive self-reflection through our SOPHIA framework. It first imagines a future by generating a video. An integrated VLM-based critic then evaluates whether this imagined future is physically plausible, allowing the model to reflect and refine its own understanding in a closed loop.

Closing the Imagination-to-Action Loop: WoW translates its internal thoughts into physical reality. Using a novel Inverse Dynamics Model, it converts its refined, physically-grounded imaginations directly into executable actions for a physical robot. This grounds abstract knowledge in tangible consequences.

We believe WoW is more than a next-generation video model. It is a blueprint for a new class of AI that moves beyond statistical pattern matching toward a verifiable, causal understanding of reality. It represents the nascent form of a true world model: one that learns, reasons, reflects, and ultimately, acts.

We have also open-sourced WoWBench, our new benchmark for evaluating physical and causal reasoning in world models, to facilitate community progress. We invite you to explore our work and join the conversation on the future of embodied cognition.