README.md · LoveHandles/TransPixelerTest at main

metadata

title: TransPixelerTest
app_file: app.py
sdk: gradio
sdk_version: 5.35.0

TransPixeler: Advancing Text-to-Video Generation with Transparency (CVPR2025)

Luozhou Wang*, Yijun Li**, Zhifei Chen, Jui-Hsien Wang, Zhifei Zhang, He Zhang, Zhe Lin, Ying-Cong Chen†

HKUST(GZ), HKUST, Adobe Research.

* Internship Project
** Project Lead
† Corresponding Author

Text-to-video generative models have made significant strides, enabling diverse applications in entertainment, advertising, and education. However, generating RGBA video, which includes alpha channels for transparency, remains a challenge due to limited datasets and the difficulty of adapting existing models. Alpha channels are crucial for visual effects (VFX), allowing transparent elements like smoke and reflections to blend seamlessly into scenes.
We introduce TransPixar, a method to extend pretrained video models for RGBA generation while retaining the original RGB capabilities. TransPixar leverages a diffusion transformer (DiT) architecture, incorporating alpha-specific tokens and using LoRA-based fine-tuning to jointly generate RGB and alpha channels with high consistency. By optimizing attention mechanisms, TransPixeler preserves the strengths of the original RGB model and achieves strong alignment between RGB and alpha channels despite limited training data.
Our approach effectively generates diverse and consistent RGBA videos, advancing the possibilities for VFX and interactive content creation.

📰 News

[2025.04.28] We have introduced a new development branch wan that integrates the Wan2.1 video generation model to support joint generation tasks. This branch includes training code tailored for generating both RGB and associated modalities (e.g., segmentation maps, alpha masks) from a shared text prompt.
[2025.02.26] TransPixeler is accepted by CVPR 2025! See you in Nashville!
[2025.01.19] We've renamed our project from TransPixar to TransPixeler!!
[2025.01.17] We’ve created a Discord group and a WeChat group! Everyone is welcome to join for discussions and collaborations.
[2025.01.14] Added new tasks to the repository's roadmap, including support for Hunyuan and LTX video models, and ComfyUI integration.
[2025.01.07] Released project page, arXiv paper, inference code, and Hugging Face demo.

🔥 New Branch for Joint Generation with Wan2.1

We have introduced a new development branch wan that integrates the Wan2.1 video generation model to support joint generation tasks.

In the wan branch, we have developed and released training code tailored for joint generation scenarios, enabling the simultaneous generation of RGB videos and associated modalities (e.g., segmentation maps, alpha masks) from a shared text prompt.

Key features of the wan branch:

Integration of Wan2.1: Leverages the capabilities of the Wan2.1 video generation model for enhanced performance.
Joint Generation Support: Facilitates the concurrent generation of RGB and paired modality videos.
Dataset Structure: Expects each sample to include:
- A primary video file (001.mp4) representing the RGB content.
- A paired secondary video file (001_seg.mp4) with a fixed _seg suffix, representing the associated modality.
- A caption text file (001.txt) with the same base name as the primary video.
Periodic Evaluation: Supports periodic video sampling during training by setting eval_every_step or eval_every_epoch in the configuration.
Customized Pipelines: Offers tailored training and inference pipelines designed specifically for joint generation tasks.

👉 To utilize the joint generation features, please checkout the wan branch.

Installation
TransPixar LoRA Weights
Training
Inference
Acknowledgement
Citation

Installation

# For the main branch
conda create -n TransPixeler python=3.10
conda activate TransPixeler
pip install -r requirements.txt

Note:
If you want to use the Wan2.1 model, please first checkout the wan branch:

git checkout wan

TransPixeler LoRA Weights

Our pipeline is designed to support various video tasks, including Text-to-RGBA Video, Image-to-RGBA Video.

We provide the following pre-trained LoRA weights:

Task	Base Model	Frames	LoRA weights	Inference VRAM
T2V + RGBA	THUDM/CogVideoX-5B	49	link	~24GB

Training - RGB + Alpha Joint Generation

We have open-sourced the training code for Mochi on RGBA joint generation. Please refer to the Mochi README for details.

Inference - Gradio Demo

In addition to the Hugging Face online demo, users can also launch a local inference demo based on CogVideoX-5B by running the following command:

python app.py

Inference - Command Line Interface (CLI)

To generate RGBA videos, navigate to the corresponding directory for the video model and execute the following command:

python cli.py \
    --lora_path /path/to/lora \
    --prompt "..."

Acknowledgement

finetrainers: We followed their implementation of Mochi training and inference.
CogVideoX: We followed their implementation of CogVideoX training and inference.

We are grateful for their exceptional work and generous contribution to the open-source community.

Citation

@misc{wang2025transpixeler,
      title={TransPixeler: Advancing Text-to-Video Generation with Transparency}, 
      author={Luozhou Wang and Yijun Li and Zhifei Chen and Jui-Hsien Wang and Zhifei Zhang and He Zhang and Zhe Lin and Ying-Cong Chen},
      year={2025},
      eprint={2501.03006},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2501.03006}, 
}

Spaces:

LoveHandles
/

TransPixelerTest

Runtime error