|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
- zh |
|
|
pipeline_tag: text-to-video |
|
|
tags: |
|
|
- zen |
|
|
- hanzo-ai |
|
|
- video-generation |
|
|
- text-to-video |
|
|
- image-to-video |
|
|
- wan2.2 |
|
|
- diffusion |
|
|
base_model: Wan-AI/Wan2.2-TI2V-5B |
|
|
--- |
|
|
|
|
|
# Zen Director |
|
|
|
|
|
Video generation model based on Wan 2.2, specialized for text-to-video and image-to-video generation. |
|
|
|
|
|
## Base Model |
|
|
|
|
|
Built on **[Wan-AI/Wan2.2-TI2V-5B](https://huggingface.co/Wan-AI/Wan2.2-TI2V-5B)** - Text-to-Image-to-Video model with 5B parameters. |
|
|
|
|
|
**Note:** This is based on Wan 2.2. Wan 2.5 is announced but not yet open-source. We will upgrade to Wan 2.5 when it becomes available. |
|
|
|
|
|
## Capabilities |
|
|
|
|
|
- **Text-to-Video**: Generate videos from text descriptions |
|
|
- **Image-to-Video**: Animate static images into videos |
|
|
- **High Resolution**: Supports high-quality video generation |
|
|
- **Efficient**: Optimized MoE architecture for fast inference |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Architecture**: Mixture-of-Experts (MoE) Transformer |
|
|
- **Parameters**: 5B total |
|
|
- **Base**: Wan 2.2 TI2V |
|
|
- **Resolution**: Up to 1280x720 |
|
|
- **Frame Rate**: 24 FPS |
|
|
- **Duration**: Up to 5 seconds |
|
|
|
|
|
## Installation |
|
|
|
|
|
```bash |
|
|
pip install diffusers transformers accelerate torch |
|
|
pip install av opencv-python pillow |
|
|
``` |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Text-to-Video |
|
|
|
|
|
```python |
|
|
from diffusers import DiffusionPipeline |
|
|
import torch |
|
|
|
|
|
# Load the model |
|
|
pipe = DiffusionPipeline.from_pretrained( |
|
|
"zenlm/zen-director", |
|
|
torch_dtype=torch.float16 |
|
|
) |
|
|
pipe = pipe.to("cuda") |
|
|
|
|
|
# Generate video from text |
|
|
prompt = "A serene sunset over a calm ocean with waves gently lapping at the shore" |
|
|
video = pipe(prompt, num_frames=120, height=720, width=1280).frames |
|
|
|
|
|
# Save video |
|
|
from diffusers.utils import export_to_video |
|
|
export_to_video(video, "output.mp4", fps=24) |
|
|
``` |
|
|
|
|
|
### Image-to-Video |
|
|
|
|
|
```python |
|
|
from PIL import Image |
|
|
|
|
|
# Load starting image |
|
|
image = Image.open("input.jpg") |
|
|
|
|
|
# Generate video from image |
|
|
video = pipe( |
|
|
prompt="Animate this image with gentle camera movement", |
|
|
image=image, |
|
|
num_frames=120 |
|
|
).frames |
|
|
|
|
|
export_to_video(video, "animated.mp4", fps=24) |
|
|
``` |
|
|
|
|
|
## Performance |
|
|
|
|
|
- **Inference Speed**: ~2-3 seconds/frame on A100 |
|
|
- **Memory**: Requires 24GB+ VRAM for full resolution |
|
|
- **Quantization**: FP16 recommended for consumer GPUs |
|
|
|
|
|
## Roadmap |
|
|
|
|
|
- โ
**v1.0** - Wan 2.2 TI2V-5B base (current) |
|
|
- ๐ **v2.0** - Upgrade to Wan 2.5 when open-source |
|
|
- ๐ **Future** - Fine-tuning for specific styles and domains |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Requires high-end GPU (24GB+ VRAM recommended) |
|
|
- Video duration limited to 5 seconds |
|
|
- Best results with detailed, specific prompts |
|
|
- Some motion artifacts in complex scenes |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{zen-director-2025, |
|
|
title={Zen Director: Video Generation with Wan 2.2}, |
|
|
author={Hanzo AI}, |
|
|
year={2025}, |
|
|
publisher={HuggingFace}, |
|
|
howpublished={\url{https://huggingface.co/zenlm/zen-director}} |
|
|
} |
|
|
|
|
|
@article{wan2024, |
|
|
title={Wan 2.2: High-Quality Video Generation}, |
|
|
author={Wan-AI Team}, |
|
|
journal={arXiv preprint}, |
|
|
year={2024} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
Apache 2.0 |
|
|
|
|
|
--- |
|
|
|
|
|
**Note**: Based on Wan 2.2. Will be upgraded to Wan 2.5 when it becomes open-source. |
|
|
|