Spaces:
Running
on
Zero
Running
on
Zero
File size: 17,957 Bytes
96257b2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 |
<p align="center">
<img src="https://github.com/Yaofang-Liu/Pusa-VidGen/blob/f867c49d9570b88e7bbce6e25583a0ad2417cdf7/icon.png" width="70"/>
</p>
# Pusa: Thousands Timesteps Video Diffusion Model
<p align="center">
<a href="https://yaofang-liu.github.io/Pusa_Web/"><img alt="Project Page" src="https://img.shields.io/badge/Project-Page-blue?style=for-the-badge"></a>
<a href="https://github.com/Yaofang-Liu/Pusa-VidGen/blob/e99c3dcf866789a2db7fbe2686888ec398076a82/PusaV1/PusaV1.0_Report.pdf"><img alt="Technical Report" src="https://img.shields.io/badge/Technical_Report-π-B31B1B?style=for-the-badge"></a>
<a href="https://huggingface.co/RaphaelLiu/PusaV1"><img alt="Model" src="https://img.shields.io/badge/Pusa_V1.0-Model-FFD700?style=for-the-badge&logo=huggingface"></a>
<a href="https://huggingface.co/datasets/RaphaelLiu/PusaV1_training"><img alt="Dataset" src="https://img.shields.io/badge/Pusa_V1.0-Dataset-6495ED?style=for-the-badge&logo=huggingface"></a>
</p>
<p align="center">
<a href="https://github.com/Yaofang-Liu/Mochi-Full-Finetuner"><img alt="Code" src="https://img.shields.io/badge/Code-Training%20Scripts-32CD32?logo=github"></a>
<a href="https://arxiv.org/abs/2410.03160"><img alt="Paper" src="https://img.shields.io/badge/π-FVDM%20Paper-B31B1B?logo=arxiv"></a>
<a href="https://x.com/stephenajason"><img alt="Twitter" src="https://img.shields.io/badge/π¦-Twitter-1DA1F2?logo=twitter"></a>
<a href="https://www.xiaohongshu.com/user/profile/5c6f928f0000000010015ca1?xsec_token=YBEf_x-s5bOBQIMJuNQvJ6H23Anwey1nnDgC9wiLyDHPU=&xsec_source=app_share&xhsshare=CopyLink&appuid=5c6f928f0000000010015ca1&apptime=1752622393&share_id=60f9a8041f974cb7ac5e3f0f161bf748"><img alt="Xiaohongshu" src="https://img.shields.io/badge/π-Xiaohongshu-FF2442"></a>
</p>
## π₯π₯π₯π Announcing Pusa V1.0 ππ₯π₯π₯
We are excited to release **Pusa V1.0**, a groundbreaking paradigm that leverages **vectorized timestep adaptation (VTA)** to enable fine-grained temporal control within a unified video diffusion framework. By finetuning the SOTA **Wan-T2V-14B** model with VTA, Pusa V1.0 achieves unprecedented efficiency --**surpassing the performance of Wan-I2V-14B with β€ 1/200 of the training cost ($500 vs. β₯ $100,000)** and **β€ 1/2500 of the dataset size (4K vs. β₯ 10M samples)**. The codebase has been integrated into the `PusaV1` directory, based on `DiffSynth-Studio`.
<img width="1000" alt="Image" src="https://github.com/Yaofang-Liu/Pusa-VidGen/blob/d98ef44c1f7c11724a6887b71fe35152493c68b4/PusaV1/pusa_benchmark_figure_dark.png" />
Pusa V1.0 not only sets a new standard for image-to-video generation but also unlocks many other zero-shot multi-task capabilities such as start-end frames and video extension, all without task-specific training while preserving the base model's T2V capabilities.
For detailed usage and examples for Pusa V1.0, please see the **[Pusa V1.0 README](./PusaV1/README.md)**.
## News
#### π₯π₯π₯ 2025.07: Pusa V1.0 (Pusa-Wan) Code, Technical Report, and Dataset, all released!!! Check our [project page](https://yaofang-liu.github.io/Pusa_Web/) and [paper](https://github.com/Yaofang-Liu/Pusa-VidGen/blob/e99c3dcf866789a2db7fbe2686888ec398076a82/PusaV1/PusaV1.0_Report.pdf) for more info.
#### π₯π₯π₯ 2025.04: Pusa V0.5 (Pusa-Mochi) released.
<p align="center">
<img src="https://github.com/Yaofang-Liu/Pusa-VidGen/blob/55de93a198427525e23a509e0f0d04616b10d71f/assets/demo0.gif" width="1000" autoplay loop muted/>
<br>
<em>Pusa V0.5 showcases </em>
</p>
<p align="center">
<img src="https://github.com/Yaofang-Liu/Pusa-VidGen/blob/8d2af9cad78859361cb1bc6b8df56d06b2c2fbb8/assets/demo_T2V.gif" width="1000" autoplay loop muted/>
<br>
<em>Pusa V0.5 still can do text-to-video generation like base model Mochi </em>
</p>
**Pusa can do many more other things, you may check details below.**
## Table of Contents
- [Overview](#overview)
- [Changelog](#changelog)
- [Pusa V1.0 (Based on Wan)](#pusa-v10-based-on-wan)
- [Pusa V0.5 (Based on Mochi)](#pusa-v05-based-on-mochi)
- [Training](#training)
- [Limitations](#limitations)
- [Current Status and Roadmap](#current-status-and-roadmap)
- [Related Work](#related-work)
- [BibTeX](#bibtex)
## Overview
Pusa (*pu: 'sA:*, from "Thousand-Hand Guanyin" in Chinese) introduces a paradigm shift in video diffusion modeling through frame-level noise control with vectorized timesteps, departing from conventional scalar timestep approaches. This shift was first presented in our [FVDM](https://arxiv.org/abs/2410.03160) paper.
**Pusa V1.0** is based on the SOTA **Wan-T2V-14B** model and enhances it with our unique vectorized timestep adaptations (VTA), a non-destructive adaptation that fully preserves the capabilities of the base model.
**Pusa V0.5** leverages this architecture, and it is based on [Mochi1-Preview](https://huggingface.co/genmo/mochi-1-preview). We are open-sourcing this work to foster community collaboration, enhance methodologies, and expand capabilities.
Pusa's novel frame-level noise architecture with vectorized timesteps compared with conventional video diffusion models with a scalar timestep
https://github.com/user-attachments/assets/7d751fd8-9a14-42e6-bcde-6db940df6537
### β¨ Key Features
- **Comprehensive Multi-task Support**:
- Text-to-Video
- Image-to-Video
- Start-End Frames
- Video completion/transitions
- Video Extension
- And more...
- **Unprecedented Efficiency**:
- Surpasses Wan-I2V-14B with **β€ 1/200 of the training cost** (\$500 vs. β₯ \$100,000)
- Trained on a dataset **β€ 1/2500 of the size** (4K vs. β₯ 10M samples)
- Achieves a **VBench-I2V score of 87.32%** (vs. 86.86% for Wan-I2V-14B)
- **Complete Open-Source Release**:
- Full codebase and training/inference scripts
- LoRA model weights and dataset for Pusa V1.0
- Detailed architecture specifications
- Comprehensive training methodology
### π Unique Architecture
- **Novel Diffusion Paradigm**: Implements frame-level noise control with vectorized timesteps, originally introduced in the [FVDM paper](https://arxiv.org/abs/2410.03160), enabling unprecedented flexibility and scalability.
- **Non-destructive Modification**: Our adaptations to the base model preserve its original Text-to-Video generation capabilities. After this adaptation, we only need a slight fine-tuning.
- **Universal Applicability**: The methodology can be readily applied to other leading video diffusion models including Hunyuan Video, Wan2.1, and others. *Collaborations enthusiastically welcomed!*
## Changelog
**v1.0 (July 15, 2025)**
- Released Pusa V1.0, based on the Wan-Video models.
- Released Technical Report, V1.0 model weights and dataset.
- Integrated codebase as `/PusaV1`.
- Added new examples and training scripts for Pusa V1.0 in `PusaV1/`.
- Updated documentation for the V1.0 release.
**v0.5 (June 3, 2025)**
- Released inference scripts for Start&End Frames Generation, Multi-Frames Generation, Video Transition, and Video Extension.
**v0.5 (April 10, 2025)**
- Released our training codes and details [here](https://github.com/Yaofang-Liu/Mochi-Full-Finetuner)
- Support multi-nodes/single-node full finetuning code for both Pusa and Mochi
- Released our training dataset [dataset](https://huggingface.co/datasets/RaphaelLiu/PusaV0.5_Training)
## Pusa V1.0 (Based on Wan)
Pusa V1.0 leverages the powerful Wan-Video models and enhances them with our custom LoRA models and training scripts. For detailed instructions on installation, model preparation, usage examples, and training, please refer to the **[Pusa V1.0 README](./PusaV1/README.md)**.
## Pusa V0.5 (Based on Mochi)
<details>
<summary>Click to expand for Pusa V0.5 details</summary>
### Installation
You may install using [uv](https://github.com/astral-sh/uv):
```bash
git clone https://github.com/genmoai/models
cd models
pip install uv
uv venv .venv
source .venv/bin/activate
uv pip install setuptools
uv pip install -e . --no-build-isolation
```
If you want to install flash attention, you can use:
```
uv pip install -e .[flash] --no-build-isolation
```
### Download Weights
**Option 1**: Use the Hugging Face CLI:
```bash
pip install huggingface_hub
huggingface-cli download RaphaelLiu/Pusa-V0.5 --local-dir <path_to_downloaded_directory>
```
**Option 2**: Download directly from [Hugging Face](https://huggingface.co/RaphaelLiu/Pusa-V0.5) to your local machine.
## Usage
### Image-to-Video Generation
```bash
python ./demos/cli_test_ti2v_release.py \
--model_dir "/path/to/Pusa-V0.5" \
--dit_path "/path/to/Pusa-V0.5/pusa_v0_dit.safetensors" \
--prompt "Your_prompt_here" \
--image_dir "/path/to/input/image.jpg" \
--cond_position 0 \
--num_steps 30 \
--noise_multiplier 0
```
Note: We suggest you to try different `con_position` here, and you may also modify the level of noise added to the condition image. You'd be likely to get some surprises.
Take `./demos/example.jpg` as an example and run with 4 GPUs:
```bash
CUDA_VISIBLE_DEVICES=0,1,2,3 python ./demos/cli_test_ti2v_release.py \
--model_dir "/path/to/Pusa-V0.5" \
--dit_path "/path/to/Pusa-V0.5/pusa_v0_dit.safetensors" \
--prompt "The camera remains still, the man is surfing on a wave with his surfboard." \
--image_dir "./demos/example.jpg" \
--cond_position 0 \
--num_steps 30 \
--noise_multiplier 0.4
```
You can get this result:
<p align="center">
<img src="https://github.com/Yaofang-Liu/Pusa-VidGen/blob/62526737953d9dc757414f2a368b94a0492ca6da/assets/example.gif" width="300" autoplay loop muted/>
<br>
</p>
You may ref to the baselines' results from the [VideoGen-Eval](https://github.com/AILab-CVC/VideoGen-Eval) benchmark for comparison:
<p align="center">
<img src="https://github.com/Yaofang-Liu/Pusa-VidGen/blob/62526737953d9dc757414f2a368b94a0492ca6da/assets/example_baseline.gif" width="1000" autoplay loop muted/>
<br>
</p>
#### Processing A Group of Images
```bash
python ./demos/cli_test_ti2v_release.py \
--model_dir "/path/to/Pusa-V0.5" \
--dit_path "/path/to/Pusa-V0.5/pusa_v0_dit.safetensors" \
--image_dir "/path/to/image/directory" \
--prompt_dir "/path/to/prompt/directory" \
--cond_position 1 \
--num_steps 30
```
For group processing, each image should have a corresponding text file with the same name in the prompt directory.
#### Using the Provided Shell Script
We also provide a shell script for convenience:
```bash
# Edit cli_test_ti2v_release.sh to set your paths
# Then run:
bash ./demos/cli_test_ti2v_release.sh
```
### Multi-frame Condition
Pusa supports generating videos from multiple keyframes (2 or more) placed at specific positions in the sequence. This is useful for both start-end frame generation and multi-keyframe interpolation.
#### Start & End Frame Generation
```bash
python ./demos/cli_test_multi_frames_release.py \
--model_dir "/path/to/Pusa-V0.5" \
--dit_path "/path/to/Pusa-V0.5/pusa_v0_dit.safetensors" \
--prompt "Drone view of waves crashing against the rugged cliffs along Big Surβs garay point beach. The crashing blue waters create white-tipped waves, while the golden light of the setting sun illuminates the rocky shore. A small island with a lighthouse sits in the distance, and green shrubbery covers the cliffβs edge. The steep drop from the road down to the beach is a dramatic feat, with the cliffβs edges jutting out over the sea. This is a view that captures the raw beauty of the coast and the rugged landscape of the Pacific Coast Highway." \
--multi_cond '{"0": ["./demos/example3.jpg", 0.3], "20": ["./demos/example5.jpg", 0.7]}' \
--num_steps 30
```
The `multi_cond` parameter specifies frame condition positions and their corresponding image paths and noise multipliers. In this example, the first frame (position 0) uses `./demos/example3.jpg` with noise multiplier 0.3, and frame 20 uses `./demos/example5.jpg` with noise multiplier 0.5.
Alternatively, use the provided shell script:
```bash
# Edit parameters in cli_test_multi_frames_release.sh first
bash ./demos/cli_test_multi_frames_release.sh
```
#### Multi-keyframe Interpolation
To generate videos with more than two keyframes (e.g., start, middle, and end):
```bash
python ./demos/cli_test_multi_frames_release.py \
--model_dir "/path/to/Pusa-V0.5" \
--dit_path "/path/to/Pusa-V0.5/pusa_v0_dit.safetensors" \
--prompt "Drone view of waves crashing against the rugged cliffs along Big Surβs garay point beach. The crashing blue waters create white-tipped waves, while the golden light of the setting sun illuminates the rocky shore. A small island with a lighthouse sits in the distance, and green shrubbery covers the cliffβs edge. The steep drop from the road down to the beach is a dramatic feat, with the cliffβs edges jutting out over the sea. This is a view that captures the raw beauty of the coast and the rugged landscape of the Pacific Coast Highway." \
--multi_cond '{"0": ["./demos/example3.jpg", 0.3], "13": ["./demos/example4.jpg", 0.7], "27": ["./demos/example5.jpg", 0.7]}' \
--num_steps 30
```
### Video Transition
Create smooth transitions between two videos:
```bash
python ./demos/cli_test_transition_release.py \
--model_dir "/path/to/Pusa-V0.5" \
--dit_path "/path/to/Pusa-V0.5/pusa_v0_dit.safetensors" \
--prompt "A fluffy Cockapoo, perched atop a vibrant pink flamingo jumps into a crystal-clear pool." \
--video_start_dir "./demos/example1.mp4" \
--video_end_dir "./demos/example2.mp4" \
--cond_position_start "[0]" \
--cond_position_end "[-3,-2,-1]" \
--noise_multiplier "[0.3,0.8,0.8,0.8]" \
--num_steps 30
```
Parameters:
- `cond_position_start`: Frame indices from the start video to use as conditioning
- `cond_position_end`: Frame indices from the end video to use as conditioning
- `noise_multiplier`: Noise level multipliers for each conditioning frame
Alternatively, use the provided shell script:
```bash
# Edit parameters in cli_test_transition_release.sh first
bash ./demos/cli_test_transition_release.sh
```
### Video Extension
Extend existing videos with generated content:
```bash
python ./demos/cli_test_extension_release.py \
--model_dir "/path/to/Pusa-V0.5" \
--dit_path "/path/to/Pusa-V0.5/pusa_v0_dit.safetensors" \
--prompt "A cinematic shot captures a fluffy Cockapoo, perched atop a vibrant pink flamingo float, in a sun-drenched Los Angeles swimming pool. The crystal-clear water sparkles under the bright California sun, reflecting the playful scene." \
--video_dir "./demos/example1.mp4" \
--cond_position "[0,1,2,3]" \
--noise_multiplier "[0.1,0.2,0.3,0.4]" \
--num_steps 30
```
Parameters:
- `cond_position`: Frame indices from the input video to use as conditioning
- `noise_multiplier`: Noise level multipliers for each conditioning frame
Alternatively, use the provided shell script:
```bash
# Edit parameters in cli_test_v2v_release.sh first
bash ./demos/cli_test_v2v_release.sh
```
### Text-to-Video Generation
```bash
python ./demos/cli_test_ti2v_release.py \
--model_dir "/path/to/Pusa-V0.5" \
--dit_path "/path/to/Pusa-V0.5/pusa_v0_dit.safetensors" \
--prompt "A man is playing basketball" \
--num_steps 30
```
</details>
## Training
For Pusa V1.0, please find the training details in the **[Pusa V1.0 README](./PusaV1/README.md#training)**.
For Pusa V0.5, you can find our training code and details [here](https://github.com/Yaofang-Liu/Mochi-Full-Finetuner), which also supports training for the original Mochi model.
## Limitations
Pusa currently has several known limitations:
- Video generation quality is dependent on the base model (e.g., Wan-T2V-14B for V1.0).
- We anticipate significant quality improvements when applying our methodology to more advanced models.
- We welcome community contributions to enhance model performance and extend its capabilities.
### Currently Available
- β
Model weights for Pusa V1.0 and V0.5
- β
Inference code for Text-to-Video generation
- β
Inference code for Image-to-Video generation
- β
Inference scripts for start & end frames, multi-frames, video transition, video extension
- β
Training code and details
- β
Model full fine-tuning guide (for Pusa V0.5)
- β
Training datasets
- β
Technical Report for Pusa V1.0
### TODO List
- π Release more advanced versions with SOTA models
- π More capabilities like long video generation
- π ....
## Related Work
- [FVDM](https://arxiv.org/abs/2410.03160): Introduces the groundbreaking frame-level noise control with vectorized timestep approach that inspired Pusa.
- [Wan-Video](https://github.com/modelscope/DiffSynth-Studio): The foundation model for Pusa V1.0.
- [Mochi](https://huggingface.co/genmo/mochi-1-preview): The foundation model for Pusa V0.5, recognized as a leading open-source video generation system on the Artificial Analysis Leaderboard.
## BibTeX
If you use this work in your project, please cite the following references.
```
@misc{Liu2025pusa,
title={Pusa: Thousands Timesteps Video Diffusion Model},
author={Yaofang Liu and Rui Liu},
year={2025},
url={https://github.com/Yaofang-Liu/Pusa-VidGen},
}
```
```
@article{liu2024redefining,
Β title={Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach},
Β author={Liu, Yaofang and Ren, Yumeng and Cun, Xiaodong and Artola, Aitor and Liu, Yang and Zeng, Tieyong and Chan, Raymond H and Morel, Jean-michel},
Β journal={arXiv preprint arXiv:2410.03160},
Β year={2024}
}
```
|