Papers
arxiv:2604.19636

CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation

Published on Apr 21
· Submitted by
xuguo
on Apr 22
#2 Paper of the day
Authors:
,
,

Abstract

CoInteract presents an end-to-end framework for human-object interaction video synthesis using a Diffusion Transformer backbone with specialized modules for structural stability and physical plausibility.

AI-generated summary

Synthesizing human--object interaction (HOI) videos has broad practical value in e-commerce, digital advertising, and virtual marketing. However, current diffusion models, despite their photorealistic rendering capability, still frequently fail on (i) the structural stability of sensitive regions such as hands and faces and (ii) physically plausible contact (e.g., avoiding hand--object interpenetration). We present CoInteract, an end-to-end framework for HOI video synthesis conditioned on a person reference image, a product reference image, text prompts, and speech audio. CoInteract introduces two complementary designs embedded into a Diffusion Transformer (DiT) backbone. First, we propose a Human-Aware Mixture-of-Experts (MoE) that routes tokens to lightweight, region-specialized experts via spatially supervised routing, improving fine-grained structural fidelity with minimal parameter overhead. Second, we propose Spatially-Structured Co-Generation, a dual-stream training paradigm that jointly models an RGB appearance stream and an auxiliary HOI structure stream to inject interaction geometry priors. During training, the HOI stream attends to RGB tokens and its supervision regularizes shared backbone weights; at inference, the HOI branch is removed for zero-overhead RGB generation. Experimental results demonstrate that CoInteract significantly outperforms existing methods in structural stability, logical consistency, and interaction realism.

Community

Paper author Paper submitter

We introduce CoInteract, a spatially-structured co-generation framework for speech-driven human-object interaction video synthesis.
Project Page: https://xinxiaozhe12345.github.io/CoInteract_Project/
Code: https://github.com/luoxyhappy/CoInteract

强,找时间细读一下

Interesting breakdown of this paper on arXivLens: https://arxivlens.com/PaperView/Details/cointeract-physically-consistent-human-object-interaction-video-synthesis-via-spatially-structured-co-generation-466-2e94712f
Covers the executive summary, detailed methodology, and practical applications.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.19636
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.19636 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.19636 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.