Spaces:
Runtime error
A newer version of the Gradio SDK is available:
5.42.0
title: TransPixelerTest
app_file: app.py
sdk: gradio
sdk_version: 5.35.0
TransPixeler: Advancing Text-to-Video Generation with Transparency (CVPR2025)
Luozhou Wang*, Yijun Li**, Zhifei Chen, Jui-Hsien Wang, Zhifei Zhang, He Zhang, Zhe Lin, Ying-Cong Chenโ
HKUST(GZ), HKUST, Adobe Research.
* Internship Project
** Project Lead
โ Corresponding Author
Text-to-video generative models have made significant strides, enabling diverse applications in entertainment, advertising, and education. However, generating RGBA video, which includes alpha channels for transparency, remains a challenge due to limited datasets and the difficulty of adapting existing models. Alpha channels are crucial for visual effects (VFX), allowing transparent elements like smoke and reflections to blend seamlessly into scenes.
We introduce TransPixar, a method to extend pretrained video models for RGBA generation while retaining the original RGB capabilities. TransPixar leverages a diffusion transformer (DiT) architecture, incorporating alpha-specific tokens and using LoRA-based fine-tuning to jointly generate RGB and alpha channels with high consistency. By optimizing attention mechanisms, TransPixeler preserves the strengths of the original RGB model and achieves strong alignment between RGB and alpha channels despite limited training data.
Our approach effectively generates diverse and consistent RGBA videos, advancing the possibilities for VFX and interactive content creation.
๐ฐ News
[2025.04.28] We have introduced a new development branch
wan
that integrates the Wan2.1 video generation model to support joint generation tasks. This branch includes training code tailored for generating both RGB and associated modalities (e.g., segmentation maps, alpha masks) from a shared text prompt.[2025.02.26] TransPixeler is accepted by CVPR 2025! See you in Nashville!
[2025.01.19] We've renamed our project from TransPixar to TransPixeler!!
[2025.01.17] Weโve created a Discord group and a WeChat group! Everyone is welcome to join for discussions and collaborations.
[2025.01.14] Added new tasks to the repository's roadmap, including support for Hunyuan and LTX video models, and ComfyUI integration.
[2025.01.07] Released project page, arXiv paper, inference code, and Hugging Face demo.
๐ฅ New Branch for Joint Generation with Wan2.1
We have introduced a new development branch wan
that integrates the Wan2.1 video generation model to support joint generation tasks.
In the wan
branch, we have developed and released training code tailored for joint generation scenarios, enabling the simultaneous generation of RGB videos and associated modalities (e.g., segmentation maps, alpha masks) from a shared text prompt.
Key features of the wan
branch:
- Integration of Wan2.1: Leverages the capabilities of the Wan2.1 video generation model for enhanced performance.
- Joint Generation Support: Facilitates the concurrent generation of RGB and paired modality videos.
- Dataset Structure: Expects each sample to include:
- A primary video file (
001.mp4
) representing the RGB content. - A paired secondary video file (
001_seg.mp4
) with a fixed_seg
suffix, representing the associated modality. - A caption text file (
001.txt
) with the same base name as the primary video.
- A primary video file (
- Periodic Evaluation: Supports periodic video sampling during training by setting
eval_every_step
oreval_every_epoch
in the configuration. - Customized Pipelines: Offers tailored training and inference pipelines designed specifically for joint generation tasks.
๐ To utilize the joint generation features, please checkout the wan
branch.
Contents
Installation
# For the main branch
conda create -n TransPixeler python=3.10
conda activate TransPixeler
pip install -r requirements.txt
Note:
If you want to use the Wan2.1 model, please first checkout the wan
branch:
git checkout wan
TransPixeler LoRA Weights
Our pipeline is designed to support various video tasks, including Text-to-RGBA Video, Image-to-RGBA Video.
We provide the following pre-trained LoRA weights:
Task | Base Model | Frames | LoRA weights | Inference VRAM |
---|---|---|---|---|
T2V + RGBA | THUDM/CogVideoX-5B | 49 | link | ~24GB |
Training - RGB + Alpha Joint Generation
We have open-sourced the training code for Mochi on RGBA joint generation. Please refer to the Mochi README for details.
Inference - Gradio Demo
In addition to the Hugging Face online demo, users can also launch a local inference demo based on CogVideoX-5B by running the following command:
python app.py
Inference - Command Line Interface (CLI)
To generate RGBA videos, navigate to the corresponding directory for the video model and execute the following command:
python cli.py \
--lora_path /path/to/lora \
--prompt "..."
Acknowledgement
- finetrainers: We followed their implementation of Mochi training and inference.
- CogVideoX: We followed their implementation of CogVideoX training and inference.
We are grateful for their exceptional work and generous contribution to the open-source community.
Citation
@misc{wang2025transpixeler,
title={TransPixeler: Advancing Text-to-Video Generation with Transparency},
author={Luozhou Wang and Yijun Li and Zhifei Chen and Jui-Hsien Wang and Zhifei Zhang and He Zhang and Zhe Lin and Ying-Cong Chen},
year={2025},
eprint={2501.03006},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2501.03006},
}