YAML Metadata
Warning:
The pipeline tag "video-generation" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, video-to-video, other
π€ WoW-1-Wan-14B-2M
WoW-1-Wan-14B is a 14-billion-parameter generative world model trained on 2 million real-world robot interaction trajectories. It is designed to imagine, reason, and act in physically consistent environments, powered by SOPHIA-guided refinement and a co-trained Inverse Dynamics Model.
This model is part of the WoW (World-Omniscient World Model) project, introduced in the paper:
WoW: Towards a World omniscient World model Through Embodied Interaction
Chi et al., 2025 β arXiv:2509.22642
π§ Key Features
- 14B parameters trained on 2M robot interaction samples
- Learns causal physical reasoning from embodied action
- Generates physically consistent video and robotic action plans
- Uses SOPHIA, a vision-language critic, to refine outputs
- Paired with an Inverse Dynamics Model to complete imagination-to-action loop
π§ͺ Training Data
- 2M Real-world robot interaction trajectories
- Multimodal scenes including vision, action, and language
- Diverse mixture captions for better generalization
π§ Mixture Caption Strategy
Prompt Lengths:
- Short: "The Franka robot, grasp the red bottle on the table"
- Long: "The scene... open the drawer, take the screwdriver, place it on the table..."
Robot Model Mixing:
- Captions reference various robot types
- Example: "grasp with the Franka Panda arm", "use end-effector to align"
Action Granularity:
- Coarse: "move to object"
- Fine: "rotate wrist 30Β° before grasping"
π Continuous Updates
This dataset will be continuously updated with:
- More trajectories
- Richer language
- Finer multimodal annotations
π§© Applications
- Zero-shot video generation in robotics
- Causal reasoning and physics simulation
- Long-horizon manipulation planning
- Forward and inverse control prediction
π Citation
@article{chi2025wow,
title={WoW: Towards a World omniscient World model Through Embodied Interaction},
author={Chi, Xiaowei and Jia, Peidong and Fan, Chun-Kai and Ju, Xiaozhu and Mi, Weishi and Qin, Zhiyuan and Zhang, Kevin and Tian, Wanxin and Ge, Kuangzhi and Li, Hao and others},
journal={arXiv preprint arXiv:2509.22642},
year={2025}
}
π Resources
- π§ Project page: wow-world-model.github.io
- π» GitHub repo: wow-world-model/wow-world-model
- π Dataset: WoW-1 Benchmark Samples
- Downloads last month
- 214
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support