Papers
arxiv:2503.10678

VRMDiff: Text-Guided Video Referring Matting Generation of Diffusion

Published on Mar 11
Authors:
,
,
,
,
,

Abstract

A new task, video referring matting, uses video diffusion models to generate temporally coherent alpha mattes from referring captions, with a Latent-Constructive loss for better instance distinction.

AI-generated summary

We propose a new task, video referring matting, which obtains the alpha matte of a specified instance by inputting a referring caption. We treat the dense prediction task of matting as video generation, leveraging the text-to-video alignment prior of video diffusion models to generate alpha mattes that are temporally coherent and closely related to the corresponding semantic instances. Moreover, we propose a new Latent-Constructive loss to further distinguish different instances, enabling more controllable interactive matting. Additionally, we introduce a large-scale video referring matting dataset with 10,000 videos. To the best of our knowledge, this is the first dataset that concurrently contains captions, videos, and instance-level alpha mattes. Extensive experiments demonstrate the effectiveness of our method. The dataset and code are available at https://github.com/Hansxsourse/VRMDiff.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2503.10678 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2503.10678 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2503.10678 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.