ABot-PhysWorld: Interactive World Foundation Model for Robotic Manipulation with Physics Alignment
Abstract
ABot-PhysWorld is a 14B Diffusion Transformer model that generates physically plausible videos through physics-aware training and evaluation on a new benchmark.
Video-based world models offer a powerful paradigm for embodied simulation and planning, yet state-of-the-art models often generate physically implausible manipulations - such as object penetration and anti-gravity motion - due to training on generic visual data and likelihood-based objectives that ignore physical laws. We present ABot-PhysWorld, a 14B Diffusion Transformer model that generates visually realistic, physically plausible, and action-controllable videos. Built on a curated dataset of three million manipulation clips with physics-aware annotation, it uses a novel DPO-based post-training framework with decoupled discriminators to suppress unphysical behaviors while preserving visual quality. A parallel context block enables precise spatial action injection for cross-embodiment control. To better evaluate generalization, we introduce EZSbench, the first training-independent embodied zero-shot benchmark combining real and synthetic unseen robot-task-scene combinations. It employs a decoupled protocol to separately assess physical realism and action alignment. ABot-PhysWorld achieves new state-of-the-art performance on PBench and EZSbench, surpassing Veo 3.1 and Sora v2 Pro in physical plausibility and trajectory consistency. We will release EZSbench to promote standardized evaluation in embodied video generation.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Kinema4D: Kinematic 4D World Modeling for Spatiotemporal Embodied Simulation (2026)
- Egocentric World Model for Photorealistic Hand-Object Interaction Synthesis (2026)
- EVA: Aligning Video World Models with Executable Robot Actions via Inverse Dynamics Rewards (2026)
- BridgeV2W: Bridging Video Generation Models to Embodied World Models via Embodiment Masks (2026)
- PhysAlign: Physics-Coherent Image-to-Video Generation through Feature and 3D Representation Alignment (2026)
- ImagiNav: Scalable Embodied Navigation via Generative Visual Prediction and Inverse Dynamics (2026)
- EgoForge: Goal-Directed Egocentric World Simulator (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper