Pusa: Thousands Timesteps Video Diffusion Model

🔥🔥🔥🚀 Announcing Pusa V1.0 🚀🔥🔥🔥

We are excited to release Pusa V1.0, a groundbreaking paradigm that leverages vectorized timestep adaptation (VTA) to enable fine-grained temporal control within a unified video diffusion framework. By finetuning the SOTA Wan-T2V-14B model with VTA, Pusa V1.0 achieves unprecedented efficiency --surpassing the performance of Wan-I2V-14B with ≤ 1/200 of the training cost ($500 vs. ≥ $100,000) and ≤ 1/2500 of the dataset size (4K vs. ≥ 10M samples). The codebase has been integrated into the PusaV1 directory, based on DiffSynth-Studio.

Pusa V1.0 not only sets a new standard for image-to-video generation but also unlocks many other zero-shot multi-task capabilities such as start-end frames and video extension, all without task-specific training while preserving the base model's T2V capabilities.

For detailed usage and examples for Pusa V1.0, please see the Pusa V1.0 README.

News

🔥🔥🔥 2025.07: Pusa V1.0 (Pusa-Wan) Code, Technical Report, and Dataset, all released!!! Check our project page and paper for more info.

🔥🔥🔥 2025.04: Pusa V0.5 (Pusa-Mochi) released.

Pusa V0.5 showcases

Pusa V0.5 still can do text-to-video generation like base model Mochi

Pusa can do many more other things, you may check details below.

Overview
Changelog
Pusa V1.0 (Based on Wan)
Pusa V0.5 (Based on Mochi)
Training
Limitations
Current Status and Roadmap
Related Work
BibTeX

Overview

Pusa (pu: 'sA:, from "Thousand-Hand Guanyin" in Chinese) introduces a paradigm shift in video diffusion modeling through frame-level noise control with vectorized timesteps, departing from conventional scalar timestep approaches. This shift was first presented in our FVDM paper.

Pusa V1.0 is based on the SOTA Wan-T2V-14B model and enhances it with our unique vectorized timestep adaptations (VTA), a non-destructive adaptation that fully preserves the capabilities of the base model.

Pusa V0.5 leverages this architecture, and it is based on Mochi1-Preview. We are open-sourcing this work to foster community collaboration, enhance methodologies, and expand capabilities.

Pusa's novel frame-level noise architecture with vectorized timesteps compared with conventional video diffusion models with a scalar timestep

https://github.com/user-attachments/assets/7d751fd8-9a14-42e6-bcde-6db940df6537

✨ Key Features

Comprehensive Multi-task Support:
- Text-to-Video
- Image-to-Video
- Start-End Frames
- Video completion/transitions
- Video Extension
- And more...
Unprecedented Efficiency:
- Surpasses Wan-I2V-14B with ≤ 1/200 of the training cost ($500 vs. ≥ $100,000)
- Trained on a dataset ≤ 1/2500 of the size (4K vs. ≥ 10M samples)
- Achieves a VBench-I2V score of 87.32% (vs. 86.86% for Wan-I2V-14B)
Complete Open-Source Release:
- Full codebase and training/inference scripts
- LoRA model weights and dataset for Pusa V1.0
- Detailed architecture specifications
- Comprehensive training methodology

🔍 Unique Architecture

Novel Diffusion Paradigm: Implements frame-level noise control with vectorized timesteps, originally introduced in the FVDM paper, enabling unprecedented flexibility and scalability.
Non-destructive Modification: Our adaptations to the base model preserve its original Text-to-Video generation capabilities. After this adaptation, we only need a slight fine-tuning.
Universal Applicability: The methodology can be readily applied to other leading video diffusion models including Hunyuan Video, Wan2.1, and others. Collaborations enthusiastically welcomed!

Changelog

v1.0 (July 15, 2025)

Released Pusa V1.0, based on the Wan-Video models.
Released Technical Report, V1.0 model weights and dataset.
Integrated codebase as /PusaV1.
Added new examples and training scripts for Pusa V1.0 in PusaV1/.
Updated documentation for the V1.0 release.

v0.5 (June 3, 2025)

Released inference scripts for Start&End Frames Generation, Multi-Frames Generation, Video Transition, and Video Extension.

v0.5 (April 10, 2025)

Released our training codes and details here
Support multi-nodes/single-node full finetuning code for both Pusa and Mochi
Released our training dataset dataset

Pusa V1.0 (Based on Wan)

Pusa V1.0 leverages the powerful Wan-Video models and enhances them with our custom LoRA models and training scripts. For detailed instructions on installation, model preparation, usage examples, and training, please refer to the Pusa V1.0 README.

Pusa V0.5 (Based on Mochi)

Click to expand for Pusa V0.5 details

Installation

You may install using uv:

git clone https://github.com/genmoai/models
cd models 
pip install uv
uv venv .venv
source .venv/bin/activate
uv pip install setuptools
uv pip install -e . --no-build-isolation

If you want to install flash attention, you can use:

uv pip install -e .[flash] --no-build-isolation

Download Weights

Option 1: Use the Hugging Face CLI:

pip install huggingface_hub
huggingface-cli download RaphaelLiu/Pusa-V0.5 --local-dir <path_to_downloaded_directory>

Option 2: Download directly from Hugging Face to your local machine.

Usage

Image-to-Video Generation

python ./demos/cli_test_ti2v_release.py \
  --model_dir "/path/to/Pusa-V0.5" \
  --dit_path "/path/to/Pusa-V0.5/pusa_v0_dit.safetensors" \
  --prompt "Your_prompt_here" \
  --image_dir "/path/to/input/image.jpg" \
  --cond_position 0 \
  --num_steps 30 \
  --noise_multiplier 0

Note: We suggest you to try different con_position here, and you may also modify the level of noise added to the condition image. You'd be likely to get some surprises.

Take ./demos/example.jpg as an example and run with 4 GPUs:

CUDA_VISIBLE_DEVICES=0,1,2,3 python ./demos/cli_test_ti2v_release.py \
  --model_dir "/path/to/Pusa-V0.5" \
  --dit_path "/path/to/Pusa-V0.5/pusa_v0_dit.safetensors" \
  --prompt "The camera remains still, the man is surfing on a wave with his surfboard." \
  --image_dir "./demos/example.jpg" \
  --cond_position 0 \
  --num_steps 30 \
  --noise_multiplier 0.4

You can get this result:

You may ref to the baselines' results from the VideoGen-Eval benchmark for comparison:

Processing A Group of Images

python ./demos/cli_test_ti2v_release.py \
  --model_dir "/path/to/Pusa-V0.5" \
  --dit_path "/path/to/Pusa-V0.5/pusa_v0_dit.safetensors" \
  --image_dir "/path/to/image/directory" \
  --prompt_dir "/path/to/prompt/directory" \
  --cond_position 1 \
  --num_steps 30

For group processing, each image should have a corresponding text file with the same name in the prompt directory.

Using the Provided Shell Script

We also provide a shell script for convenience:

# Edit cli_test_ti2v_release.sh to set your paths
# Then run:
bash ./demos/cli_test_ti2v_release.sh

Multi-frame Condition

Pusa supports generating videos from multiple keyframes (2 or more) placed at specific positions in the sequence. This is useful for both start-end frame generation and multi-keyframe interpolation.

Start & End Frame Generation

python ./demos/cli_test_multi_frames_release.py \
  --model_dir "/path/to/Pusa-V0.5" \
  --dit_path "/path/to/Pusa-V0.5/pusa_v0_dit.safetensors" \
  --prompt "Drone view of waves crashing against the rugged cliffs along Big Sur’s garay point beach. The crashing blue waters create white-tipped waves, while the golden light of the setting sun illuminates the rocky shore. A small island with a lighthouse sits in the distance, and green shrubbery covers the cliff’s edge. The steep drop from the road down to the beach is a dramatic feat, with the cliff’s edges jutting out over the sea. This is a view that captures the raw beauty of the coast and the rugged landscape of the Pacific Coast Highway." \
  --multi_cond '{"0": ["./demos/example3.jpg", 0.3], "20": ["./demos/example5.jpg", 0.7]}' \
  --num_steps 30

The multi_cond parameter specifies frame condition positions and their corresponding image paths and noise multipliers. In this example, the first frame (position 0) uses ./demos/example3.jpg with noise multiplier 0.3, and frame 20 uses ./demos/example5.jpg with noise multiplier 0.5.

Alternatively, use the provided shell script:

# Edit parameters in cli_test_multi_frames_release.sh first
bash ./demos/cli_test_multi_frames_release.sh

Multi-keyframe Interpolation

To generate videos with more than two keyframes (e.g., start, middle, and end):

python ./demos/cli_test_multi_frames_release.py \
  --model_dir "/path/to/Pusa-V0.5" \
  --dit_path "/path/to/Pusa-V0.5/pusa_v0_dit.safetensors" \
  --prompt "Drone view of waves crashing against the rugged cliffs along Big Sur’s garay point beach. The crashing blue waters create white-tipped waves, while the golden light of the setting sun illuminates the rocky shore. A small island with a lighthouse sits in the distance, and green shrubbery covers the cliff’s edge. The steep drop from the road down to the beach is a dramatic feat, with the cliff’s edges jutting out over the sea. This is a view that captures the raw beauty of the coast and the rugged landscape of the Pacific Coast Highway." \
  --multi_cond '{"0": ["./demos/example3.jpg", 0.3], "13": ["./demos/example4.jpg", 0.7], "27": ["./demos/example5.jpg", 0.7]}' \
  --num_steps 30

Video Transition

Create smooth transitions between two videos:

python ./demos/cli_test_transition_release.py \
  --model_dir "/path/to/Pusa-V0.5" \
  --dit_path "/path/to/Pusa-V0.5/pusa_v0_dit.safetensors" \
  --prompt "A fluffy Cockapoo, perched atop a vibrant pink flamingo jumps into a crystal-clear pool." \
  --video_start_dir "./demos/example1.mp4" \
  --video_end_dir "./demos/example2.mp4" \
  --cond_position_start "[0]" \
  --cond_position_end "[-3,-2,-1]" \
  --noise_multiplier "[0.3,0.8,0.8,0.8]" \
  --num_steps 30

Parameters:

cond_position_start: Frame indices from the start video to use as conditioning
cond_position_end: Frame indices from the end video to use as conditioning
noise_multiplier: Noise level multipliers for each conditioning frame

Alternatively, use the provided shell script:

# Edit parameters in cli_test_transition_release.sh first
bash ./demos/cli_test_transition_release.sh

Video Extension

Extend existing videos with generated content:

python ./demos/cli_test_extension_release.py \
  --model_dir "/path/to/Pusa-V0.5" \
  --dit_path "/path/to/Pusa-V0.5/pusa_v0_dit.safetensors" \
  --prompt "A cinematic shot captures a fluffy Cockapoo, perched atop a vibrant pink flamingo float, in a sun-drenched Los Angeles swimming pool. The crystal-clear water sparkles under the bright California sun, reflecting the playful scene." \
  --video_dir "./demos/example1.mp4" \
  --cond_position "[0,1,2,3]" \
  --noise_multiplier "[0.1,0.2,0.3,0.4]" \
  --num_steps 30

Parameters:

cond_position: Frame indices from the input video to use as conditioning
noise_multiplier: Noise level multipliers for each conditioning frame

Alternatively, use the provided shell script:

# Edit parameters in cli_test_v2v_release.sh first
bash ./demos/cli_test_v2v_release.sh

Text-to-Video Generation

python ./demos/cli_test_ti2v_release.py \
  --model_dir "/path/to/Pusa-V0.5" \
  --dit_path "/path/to/Pusa-V0.5/pusa_v0_dit.safetensors" \
  --prompt "A man is playing basketball" \
  --num_steps 30

Training

For Pusa V1.0, please find the training details in the Pusa V1.0 README.

For Pusa V0.5, you can find our training code and details here, which also supports training for the original Mochi model.

Limitations

Pusa currently has several known limitations:

Video generation quality is dependent on the base model (e.g., Wan-T2V-14B for V1.0).
We anticipate significant quality improvements when applying our methodology to more advanced models.
We welcome community contributions to enhance model performance and extend its capabilities.

Currently Available

✅ Model weights for Pusa V1.0 and V0.5
✅ Inference code for Text-to-Video generation
✅ Inference code for Image-to-Video generation
✅ Inference scripts for start & end frames, multi-frames, video transition, video extension
✅ Training code and details
✅ Model full fine-tuning guide (for Pusa V0.5)
✅ Training datasets
✅ Technical Report for Pusa V1.0

TODO List

🔄 Release more advanced versions with SOTA models
🔄 More capabilities like long video generation
🔄 ....

Related Work

FVDM: Introduces the groundbreaking frame-level noise control with vectorized timestep approach that inspired Pusa.
Wan-Video: The foundation model for Pusa V1.0.
Mochi: The foundation model for Pusa V0.5, recognized as a leading open-source video generation system on the Artificial Analysis Leaderboard.

BibTeX

If you use this work in your project, please cite the following references.

@misc{Liu2025pusa,
  title={Pusa: Thousands Timesteps Video Diffusion Model},
  author={Yaofang Liu and Rui Liu},
  year={2025},
  url={https://github.com/Yaofang-Liu/Pusa-VidGen},
}

@article{liu2024redefining,
  title={Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach},
  author={Liu, Yaofang and Ren, Yumeng and Cun, Xiaodong and Artola, Aitor and Liu, Yang and Zeng, Tieyong and Chan, Raymond H and Morel, Jean-michel},
  journal={arXiv preprint arXiv:2410.03160},
  year={2024}
}

Spaces:

rahul7star
/

PusaV1

Running on Zero

Pusa: Thousands Timesteps Video Diffusion Model

🔥🔥🔥🚀 Announcing Pusa V1.0 🚀🔥🔥🔥

News

🔥🔥🔥 2025.07: Pusa V1.0 (Pusa-Wan) Code, Technical Report, and Dataset, all released!!! Check our project page and paper for more info.

🔥🔥🔥 2025.04: Pusa V0.5 (Pusa-Mochi) released.

Table of Contents

Overview

✨ Key Features

🔍 Unique Architecture

Changelog

Pusa V1.0 (Based on Wan)

Pusa V0.5 (Based on Mochi)

Installation

Download Weights

Usage

Image-to-Video Generation

Processing A Group of Images

Using the Provided Shell Script

Multi-frame Condition

Start & End Frame Generation

Multi-keyframe Interpolation

Video Transition

Video Extension

Text-to-Video Generation

Training

Limitations

Currently Available

TODO List

Related Work

BibTeX