YAML Metadata Warning: The pipeline tag "video-generation" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, video-to-video, other

πŸ€– WoW-1-Wan-14B-2M

WoW-1-Wan-14B is a 14-billion-parameter generative world model trained on 2 million real-world robot interaction trajectories. It is designed to imagine, reason, and act in physically consistent environments, powered by SOPHIA-guided refinement and a co-trained Inverse Dynamics Model.

This model is part of the WoW (World-Omniscient World Model) project, introduced in the paper:

WoW: Towards a World omniscient World model Through Embodied Interaction
Chi et al., 2025 – arXiv:2509.22642

🧠 Key Features

  • 14B parameters trained on 2M robot interaction samples
  • Learns causal physical reasoning from embodied action
  • Generates physically consistent video and robotic action plans
  • Uses SOPHIA, a vision-language critic, to refine outputs
  • Paired with an Inverse Dynamics Model to complete imagination-to-action loop

πŸ§ͺ Training Data

  • 2M Real-world robot interaction trajectories
  • Multimodal scenes including vision, action, and language
  • Diverse mixture captions for better generalization

🧠 Mixture Caption Strategy

  • Prompt Lengths:

    • Short: "The Franka robot, grasp the red bottle on the table"
    • Long: "The scene... open the drawer, take the screwdriver, place it on the table..."
  • Robot Model Mixing:

    • Captions reference various robot types
    • Example: "grasp with the Franka Panda arm", "use end-effector to align"
  • Action Granularity:

    • Coarse: "move to object"
    • Fine: "rotate wrist 30Β° before grasping"

πŸ”„ Continuous Updates

This dataset will be continuously updated with:

  • More trajectories
  • Richer language
  • Finer multimodal annotations

🧩 Applications

  • Zero-shot video generation in robotics
  • Causal reasoning and physics simulation
  • Long-horizon manipulation planning
  • Forward and inverse control prediction

πŸ“„ Citation

@article{chi2025wow,
  title={WoW: Towards a World omniscient World model Through Embodied Interaction},
  author={Chi, Xiaowei and Jia, Peidong and Fan, Chun-Kai and Ju, Xiaozhu and Mi, Weishi and Qin, Zhiyuan and Zhang, Kevin and Tian, Wanxin and Ge, Kuangzhi and Li, Hao and others},
  journal={arXiv preprint arXiv:2509.22642},
  year={2025}
}

πŸ”— Resources


Downloads last month
214
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train WoW-world-model/WoW-1-Wan-14B-600k