π¬ CoInteract
Physically-Consistent Human-Object Interaction Video Synthesis
via Spatially-Structured Co-Generation
Alibaba Group & Tsinghua University
β¨ Highlights
CoInteract is the first end-to-end framework that generates physically-consistent human-object interaction (HOI) videos with zero additional inference cost. Given a person image, a product image, text prompts, and optional speech audio, CoInteract produces realistic videos where humans naturally grasp, wear, present, and manipulate objects β with no hand-object interpenetration or geometric misalignment.
π₯ Key Results:
- π State-of-the-art on HOI video synthesis benchmarks
- π€ Physically plausible hand-object contact (significantly reduced interpenetration)
- β‘ Zero inference overhead β auxiliary HOI branch removed at test time
- π― Supports diverse interactions: grasping, wearing, presenting, carrying, and more
- π Real-world ready: Virtual Try-On, Digital Human Commerce, Physics Simulation
ποΈ Architecture
CoInteract embeds structural priors and interaction geometry directly into a Diffusion Transformer (DiT) backbone through three core innovations:
π§ Human-Aware MoE β Spatially-supervised Mixture-of-Experts routes tokens to region-specialized experts (Head, Hand, Base), ensuring high structural fidelity for hands and faces with minimal parameter overhead.
π Dual-Stream Co-Generation β An auxiliary HOI structure stream is jointly trained with the RGB stream within a shared DiT backbone, forcing the model to learn spatial and interaction relationships.
β¨ Asymmetric Co-Attention β A two-stage training strategy with asymmetric attention masks embeds physical interaction rules, enabling the HOI branch to be completely removed at inference with zero overhead.
π¬ Demo
Our model handles diverse real-world products across various scenarios. Visit our Project Page for full video demos.
Supported Interaction Types:
| Category | Examples |
|---|---|
| π€² Grasping | Macaron box Β· Teapot Β· Skincare serum Β· Coffee mug |
| π Presenting | Leather handbag Β· Eyeshadow palette Β· Decorative plate |
| π Wearing | Emerald necklace Β· Sports jacket |
| π΅ Holding | Cactus pot Β· Various daily objects |
Application Scenarios:
- ποΈ Digital Human Commerce β AI-powered digital humans presenting products in live-stream e-commerce
- π Virtual Try-On β Physically realistic garment and accessory interactions
- βοΈ Physics Simulation β Generating high-quality HOI training data for robotics
π§ Model Details
| Property | Value |
|---|---|
| Backbone | Diffusion Transformer (DiT) |
| Resolution | 720p 480p |
| Frame Count | Up to 81 frames per chunk |
| Multi-Chunk | Supported for long-form video |
| Inputs | Person image + Object image + Text prompt + Audio + (Optional) Pose |
| Training Data | Large-scale HOI video dataset with structure annotations |
| Precision | FP16 / BF16 |
| License | Apache 2.0 |
π Quantitative Results
Full quantitative results will be released upon paper acceptance.
π Citation
If you find CoInteract useful for your research, please consider citing:
@article{luo2025cointeract,
title={CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation},
author={Luo, Xiangyang and Xin, Xiaozhe and Feng, Tao and Guo, Xu and Jin, Meiguang and Ma, Junfeng},
journal={arXiv preprint arXiv:2604.19636},
year={2026}
}
π Acknowledgements
This work is supported by Taobao Live Tech, Alibaba Group and Tsinghua University.
Project Page Β· Paper Β· Code Β· Demo
If you like this project, please give us a β!
- Downloads last month
- -
