File size: 17,957 Bytes
96257b2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
<p align="center">
    <img src="https://github.com/Yaofang-Liu/Pusa-VidGen/blob/f867c49d9570b88e7bbce6e25583a0ad2417cdf7/icon.png" width="70"/>

</p>


# Pusa: Thousands Timesteps Video Diffusion Model
<p align="center">
    <a href="https://yaofang-liu.github.io/Pusa_Web/"><img alt="Project Page" src="https://img.shields.io/badge/Project-Page-blue?style=for-the-badge"></a>

    <a href="https://github.com/Yaofang-Liu/Pusa-VidGen/blob/e99c3dcf866789a2db7fbe2686888ec398076a82/PusaV1/PusaV1.0_Report.pdf"><img alt="Technical Report" src="https://img.shields.io/badge/Technical_Report-πŸ“œ-B31B1B?style=for-the-badge"></a>

    <a href="https://huggingface.co/RaphaelLiu/PusaV1"><img alt="Model" src="https://img.shields.io/badge/Pusa_V1.0-Model-FFD700?style=for-the-badge&logo=huggingface"></a>

    <a href="https://huggingface.co/datasets/RaphaelLiu/PusaV1_training"><img alt="Dataset" src="https://img.shields.io/badge/Pusa_V1.0-Dataset-6495ED?style=for-the-badge&logo=huggingface"></a>

</p>

<p align="center">

    <a href="https://github.com/Yaofang-Liu/Mochi-Full-Finetuner"><img alt="Code" src="https://img.shields.io/badge/Code-Training%20Scripts-32CD32?logo=github"></a>

    <a href="https://arxiv.org/abs/2410.03160"><img alt="Paper" src="https://img.shields.io/badge/πŸ“œ-FVDM%20Paper-B31B1B?logo=arxiv"></a>

    <a href="https://x.com/stephenajason"><img alt="Twitter" src="https://img.shields.io/badge/🐦-Twitter-1DA1F2?logo=twitter"></a>

    <a href="https://www.xiaohongshu.com/user/profile/5c6f928f0000000010015ca1?xsec_token=YBEf_x-s5bOBQIMJuNQvJ6H23Anwey1nnDgC9wiLyDHPU=&xsec_source=app_share&xhsshare=CopyLink&appuid=5c6f928f0000000010015ca1&apptime=1752622393&share_id=60f9a8041f974cb7ac5e3f0f161bf748"><img alt="Xiaohongshu" src="https://img.shields.io/badge/πŸ“•-Xiaohongshu-FF2442"></a>

</p>



## πŸ”₯πŸ”₯πŸ”₯πŸš€ Announcing Pusa V1.0 πŸš€πŸ”₯πŸ”₯πŸ”₯

We are excited to release **Pusa V1.0**, a groundbreaking paradigm that leverages **vectorized timestep adaptation (VTA)** to enable fine-grained temporal control within a unified video diffusion framework. By finetuning the SOTA **Wan-T2V-14B** model with VTA, Pusa V1.0 achieves unprecedented efficiency --**surpassing the performance of Wan-I2V-14B with ≀ 1/200 of the training cost ($500 vs. β‰₯ $100,000)** and **≀ 1/2500 of the dataset size (4K vs. β‰₯ 10M samples)**. The codebase has been integrated into the `PusaV1` directory, based on `DiffSynth-Studio`.

<img width="1000" alt="Image" src="https://github.com/Yaofang-Liu/Pusa-VidGen/blob/d98ef44c1f7c11724a6887b71fe35152493c68b4/PusaV1/pusa_benchmark_figure_dark.png" />

Pusa V1.0 not only sets a new standard for image-to-video generation but also unlocks many other zero-shot multi-task capabilities such as start-end frames and video extension, all without task-specific training while preserving the base model's T2V capabilities.

For detailed usage and examples for Pusa V1.0, please see the **[Pusa V1.0 README](./PusaV1/README.md)**.


## News 
#### πŸ”₯πŸ”₯πŸ”₯ 2025.07:  Pusa V1.0 (Pusa-Wan) Code, Technical Report, and Dataset, all released!!! Check our [project page](https://yaofang-liu.github.io/Pusa_Web/) and [paper](https://github.com/Yaofang-Liu/Pusa-VidGen/blob/e99c3dcf866789a2db7fbe2686888ec398076a82/PusaV1/PusaV1.0_Report.pdf) for more info.
#### πŸ”₯πŸ”₯πŸ”₯ 2025.04:  Pusa V0.5 (Pusa-Mochi) released.

 

<p align="center">
    <img src="https://github.com/Yaofang-Liu/Pusa-VidGen/blob/55de93a198427525e23a509e0f0d04616b10d71f/assets/demo0.gif" width="1000" autoplay loop muted/>

    <br>

    <em>Pusa V0.5 showcases </em>

</p>


<p align="center">
    <img src="https://github.com/Yaofang-Liu/Pusa-VidGen/blob/8d2af9cad78859361cb1bc6b8df56d06b2c2fbb8/assets/demo_T2V.gif" width="1000" autoplay loop muted/>

    <br>

    <em>Pusa V0.5 still can do text-to-video generation like base model Mochi </em>

</p>


**Pusa can do many more other things, you may check details below.**

 

## Table of Contents
- [Overview](#overview)
- [Changelog](#changelog)
- [Pusa V1.0 (Based on Wan)](#pusa-v10-based-on-wan)
- [Pusa V0.5 (Based on Mochi)](#pusa-v05-based-on-mochi)
- [Training](#training)
- [Limitations](#limitations)
- [Current Status and Roadmap](#current-status-and-roadmap)
- [Related Work](#related-work)
- [BibTeX](#bibtex)

## Overview

Pusa (*pu: 'sA:*, from "Thousand-Hand Guanyin" in Chinese) introduces a paradigm shift in video diffusion modeling through frame-level noise control with vectorized timesteps, departing from conventional scalar timestep approaches. This shift was first presented in our [FVDM](https://arxiv.org/abs/2410.03160) paper. 

**Pusa V1.0** is based on the SOTA **Wan-T2V-14B** model and enhances it with our unique vectorized timestep adaptations (VTA), a non-destructive adaptation that fully preserves the capabilities of the base model.

**Pusa V0.5** leverages this architecture, and it is based on [Mochi1-Preview](https://huggingface.co/genmo/mochi-1-preview). We are open-sourcing this work to foster community collaboration, enhance methodologies, and expand capabilities.


Pusa's novel frame-level noise architecture with vectorized timesteps compared with conventional video diffusion models with a scalar timestep

https://github.com/user-attachments/assets/7d751fd8-9a14-42e6-bcde-6db940df6537


### ✨ Key Features

- **Comprehensive Multi-task Support**:
  - Text-to-Video 
  - Image-to-Video 
  - Start-End Frames
  - Video completion/transitions
  - Video Extension
  - And more...

- **Unprecedented Efficiency**:
  - Surpasses Wan-I2V-14B with **≀ 1/200 of the training cost** (\$500 vs. β‰₯ \$100,000)
  - Trained on a dataset **≀ 1/2500 of the size** (4K vs. β‰₯ 10M samples)
  - Achieves a **VBench-I2V score of 87.32%** (vs. 86.86% for Wan-I2V-14B)

- **Complete Open-Source Release**:
  - Full codebase and training/inference scripts
  - LoRA model weights and dataset for Pusa V1.0
  - Detailed architecture specifications
  - Comprehensive training methodology

### πŸ” Unique Architecture

- **Novel Diffusion Paradigm**: Implements frame-level noise control with vectorized timesteps, originally introduced in the [FVDM paper](https://arxiv.org/abs/2410.03160), enabling unprecedented flexibility and scalability.

- **Non-destructive Modification**: Our adaptations to the base model preserve its original Text-to-Video generation capabilities. After this adaptation, we only need a slight fine-tuning.

- **Universal Applicability**: The methodology can be readily applied to other leading video diffusion models including Hunyuan Video, Wan2.1, and others. *Collaborations enthusiastically welcomed!*


## Changelog

**v1.0 (July 15, 2025)**
- Released Pusa V1.0, based on the Wan-Video models.
- Released Technical Report, V1.0 model weights and dataset.
- Integrated codebase as `/PusaV1`.
- Added new examples and training scripts for Pusa V1.0 in `PusaV1/`.
- Updated documentation for the V1.0 release.

**v0.5 (June 3, 2025)**
- Released inference scripts for Start&End Frames Generation, Multi-Frames Generation, Video Transition, and Video Extension.

**v0.5 (April 10, 2025)**
- Released our training codes and details [here](https://github.com/Yaofang-Liu/Mochi-Full-Finetuner)
- Support multi-nodes/single-node full finetuning code for both Pusa and Mochi
- Released our training dataset [dataset](https://huggingface.co/datasets/RaphaelLiu/PusaV0.5_Training)

## Pusa V1.0 (Based on Wan)

Pusa V1.0 leverages the powerful Wan-Video models and enhances them with our custom LoRA models and training scripts. For detailed instructions on installation, model preparation, usage examples, and training, please refer to the **[Pusa V1.0 README](./PusaV1/README.md)**.

## Pusa V0.5 (Based on Mochi)

<details>
<summary>Click to expand for Pusa V0.5 details</summary>

### Installation 

You may install using [uv](https://github.com/astral-sh/uv):

```bash

git clone https://github.com/genmoai/models

cd models 

pip install uv

uv venv .venv

source .venv/bin/activate

uv pip install setuptools

uv pip install -e . --no-build-isolation

```

If you want to install flash attention, you can use:
```

uv pip install -e .[flash] --no-build-isolation

```

### Download Weights

**Option 1**: Use the Hugging Face CLI:
```bash

pip install huggingface_hub

huggingface-cli download RaphaelLiu/Pusa-V0.5 --local-dir <path_to_downloaded_directory>

```

**Option 2**: Download directly from [Hugging Face](https://huggingface.co/RaphaelLiu/Pusa-V0.5) to your local machine.


## Usage

### Image-to-Video Generation

```bash

python ./demos/cli_test_ti2v_release.py \

  --model_dir "/path/to/Pusa-V0.5" \

  --dit_path "/path/to/Pusa-V0.5/pusa_v0_dit.safetensors" \

  --prompt "Your_prompt_here" \

  --image_dir "/path/to/input/image.jpg" \

  --cond_position 0 \

  --num_steps 30 \

  --noise_multiplier 0

```
Note: We suggest you to try different `con_position` here, and you may also modify the level of noise added to the condition image. You'd be likely to get some surprises.

Take `./demos/example.jpg` as an example and run with 4 GPUs:
```bash

CUDA_VISIBLE_DEVICES=0,1,2,3 python ./demos/cli_test_ti2v_release.py \

  --model_dir "/path/to/Pusa-V0.5" \

  --dit_path "/path/to/Pusa-V0.5/pusa_v0_dit.safetensors" \

  --prompt "The camera remains still, the man is surfing on a wave with his surfboard." \

  --image_dir "./demos/example.jpg" \

  --cond_position 0 \

  --num_steps 30 \

  --noise_multiplier 0.4

```
You can get this result:

<p align="center">
    <img src="https://github.com/Yaofang-Liu/Pusa-VidGen/blob/62526737953d9dc757414f2a368b94a0492ca6da/assets/example.gif" width="300" autoplay loop muted/>

    <br>

</p>


You may ref to the baselines' results from the [VideoGen-Eval](https://github.com/AILab-CVC/VideoGen-Eval) benchmark for comparison:

<p align="center">
    <img src="https://github.com/Yaofang-Liu/Pusa-VidGen/blob/62526737953d9dc757414f2a368b94a0492ca6da/assets/example_baseline.gif" width="1000" autoplay loop muted/>

    <br>

</p>


#### Processing A Group of Images
```bash

python ./demos/cli_test_ti2v_release.py \

  --model_dir "/path/to/Pusa-V0.5" \

  --dit_path "/path/to/Pusa-V0.5/pusa_v0_dit.safetensors" \

  --image_dir "/path/to/image/directory" \

  --prompt_dir "/path/to/prompt/directory" \

  --cond_position 1 \

  --num_steps 30

```

For group processing, each image should have a corresponding text file with the same name in the prompt directory.

#### Using the Provided Shell Script
We also provide a shell script for convenience:

```bash

# Edit cli_test_ti2v_release.sh to set your paths

# Then run:

bash ./demos/cli_test_ti2v_release.sh

```

### Multi-frame Condition

Pusa supports generating videos from multiple keyframes (2 or more) placed at specific positions in the sequence. This is useful for both start-end frame generation and multi-keyframe interpolation.

#### Start & End Frame Generation

```bash

python ./demos/cli_test_multi_frames_release.py \

  --model_dir "/path/to/Pusa-V0.5" \

  --dit_path "/path/to/Pusa-V0.5/pusa_v0_dit.safetensors" \

  --prompt "Drone view of waves crashing against the rugged cliffs along Big Sur’s garay point beach. The crashing blue waters create white-tipped waves, while the golden light of the setting sun illuminates the rocky shore. A small island with a lighthouse sits in the distance, and green shrubbery covers the cliff’s edge. The steep drop from the road down to the beach is a dramatic feat, with the cliff’s edges jutting out over the sea. This is a view that captures the raw beauty of the coast and the rugged landscape of the Pacific Coast Highway." \

  --multi_cond '{"0": ["./demos/example3.jpg", 0.3], "20": ["./demos/example5.jpg", 0.7]}' \

  --num_steps 30

```

The `multi_cond` parameter specifies frame condition positions and their corresponding image paths and noise multipliers. In this example, the first frame (position 0) uses `./demos/example3.jpg` with noise multiplier 0.3, and frame 20 uses `./demos/example5.jpg` with noise multiplier 0.5.

Alternatively, use the provided shell script:
```bash

# Edit parameters in cli_test_multi_frames_release.sh first

bash ./demos/cli_test_multi_frames_release.sh

```

#### Multi-keyframe Interpolation

To generate videos with more than two keyframes (e.g., start, middle, and end):

```bash

python ./demos/cli_test_multi_frames_release.py \

  --model_dir "/path/to/Pusa-V0.5" \

  --dit_path "/path/to/Pusa-V0.5/pusa_v0_dit.safetensors" \

  --prompt "Drone view of waves crashing against the rugged cliffs along Big Sur’s garay point beach. The crashing blue waters create white-tipped waves, while the golden light of the setting sun illuminates the rocky shore. A small island with a lighthouse sits in the distance, and green shrubbery covers the cliff’s edge. The steep drop from the road down to the beach is a dramatic feat, with the cliff’s edges jutting out over the sea. This is a view that captures the raw beauty of the coast and the rugged landscape of the Pacific Coast Highway." \

  --multi_cond '{"0": ["./demos/example3.jpg", 0.3], "13": ["./demos/example4.jpg", 0.7], "27": ["./demos/example5.jpg", 0.7]}' \

  --num_steps 30

```

### Video Transition

Create smooth transitions between two videos:

```bash

python ./demos/cli_test_transition_release.py \

  --model_dir "/path/to/Pusa-V0.5" \

  --dit_path "/path/to/Pusa-V0.5/pusa_v0_dit.safetensors" \

  --prompt "A fluffy Cockapoo, perched atop a vibrant pink flamingo jumps into a crystal-clear pool." \

  --video_start_dir "./demos/example1.mp4" \

  --video_end_dir "./demos/example2.mp4" \

  --cond_position_start "[0]" \

  --cond_position_end "[-3,-2,-1]" \

  --noise_multiplier "[0.3,0.8,0.8,0.8]" \

  --num_steps 30

```

Parameters:
- `cond_position_start`: Frame indices from the start video to use as conditioning
- `cond_position_end`: Frame indices from the end video to use as conditioning
- `noise_multiplier`: Noise level multipliers for each conditioning frame

Alternatively, use the provided shell script:
```bash

# Edit parameters in cli_test_transition_release.sh first

bash ./demos/cli_test_transition_release.sh

```

### Video Extension

Extend existing videos with generated content:

```bash

python ./demos/cli_test_extension_release.py \

  --model_dir "/path/to/Pusa-V0.5" \

  --dit_path "/path/to/Pusa-V0.5/pusa_v0_dit.safetensors" \

  --prompt "A cinematic shot captures a fluffy Cockapoo, perched atop a vibrant pink flamingo float, in a sun-drenched Los Angeles swimming pool. The crystal-clear water sparkles under the bright California sun, reflecting the playful scene." \

  --video_dir "./demos/example1.mp4" \

  --cond_position "[0,1,2,3]" \

  --noise_multiplier "[0.1,0.2,0.3,0.4]" \

  --num_steps 30

```

Parameters:
- `cond_position`: Frame indices from the input video to use as conditioning
- `noise_multiplier`: Noise level multipliers for each conditioning frame

Alternatively, use the provided shell script:
```bash

# Edit parameters in cli_test_v2v_release.sh first

bash ./demos/cli_test_v2v_release.sh

```

### Text-to-Video Generation
```bash

python ./demos/cli_test_ti2v_release.py \

  --model_dir "/path/to/Pusa-V0.5" \

  --dit_path "/path/to/Pusa-V0.5/pusa_v0_dit.safetensors" \

  --prompt "A man is playing basketball" \

  --num_steps 30

```

</details>

## Training

For Pusa V1.0, please find the training details in the **[Pusa V1.0 README](./PusaV1/README.md#training)**.

For Pusa V0.5, you can find our training code and details [here](https://github.com/Yaofang-Liu/Mochi-Full-Finetuner), which also supports training for the original Mochi model.

## Limitations

Pusa currently has several known limitations:
- Video generation quality is dependent on the base model (e.g., Wan-T2V-14B for V1.0).
- We anticipate significant quality improvements when applying our methodology to more advanced models.
- We welcome community contributions to enhance model performance and extend its capabilities.

### Currently Available
- βœ… Model weights for Pusa V1.0 and V0.5
- βœ… Inference code for Text-to-Video generation
- βœ… Inference code for Image-to-Video generation
- βœ… Inference scripts for start & end frames, multi-frames, video transition, video extension
- βœ… Training code and details
- βœ… Model full fine-tuning guide (for Pusa V0.5)
- βœ… Training datasets
- βœ… Technical Report for Pusa V1.0
  
### TODO List
- πŸ”„ Release more advanced versions with SOTA models
- πŸ”„ More capabilities like long video generation
- πŸ”„ ....

## Related Work

- [FVDM](https://arxiv.org/abs/2410.03160): Introduces the groundbreaking frame-level noise control with vectorized timestep approach that inspired Pusa.
- [Wan-Video](https://github.com/modelscope/DiffSynth-Studio): The foundation model for Pusa V1.0.
- [Mochi](https://huggingface.co/genmo/mochi-1-preview): The foundation model for Pusa V0.5, recognized as a leading open-source video generation system on the Artificial Analysis Leaderboard.

## BibTeX
If you use this work in your project, please cite the following references.
```

@misc{Liu2025pusa,

  title={Pusa: Thousands Timesteps Video Diffusion Model},

  author={Yaofang Liu and Rui Liu},

  year={2025},

  url={https://github.com/Yaofang-Liu/Pusa-VidGen},

}

```

```

@article{liu2024redefining,

Β  title={Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach},

Β  author={Liu, Yaofang and Ren, Yumeng and Cun, Xiaodong and Artola, Aitor and Liu, Yang and Zeng, Tieyong and Chan, Raymond H and Morel, Jean-michel},

Β  journal={arXiv preprint arXiv:2410.03160},

Β  year={2024}

}

```