Spaces:
Running
on
L40S
Running
on
L40S
File size: 6,793 Bytes
6de684f e78aa06 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 |
---
title: OmniAvatar
emoji: π§βπ€
colorFrom: green
colorTo: yellow
sdk: gradio
sdk_version: 5.38.2
app_file: app.py
pinned: true
---
<div align="center">
<h1>OmniAvatar: Efficient Audio-Driven Avatar Video Generation with Adaptive Body Animation</h1>
[Qijun Gan](https://agnjason.github.io/) Β· [Ruizi Yang](https://github.com/ZiziAmy/) Β· [Jianke Zhu](https://person.zju.edu.cn/en/jkzhu) Β· [Shaofei Xue]() Β· [Steven Hoi](https://scholar.google.com/citations?user=JoLjflYAAAAJ)
Zhejiang University, Alibaba Group
<div align="center">
<a href="https://omni-avatar.github.io/"><img src="https://img.shields.io/badge/Project-OmniAvatar-blue.svg"></a>  
<a href="http://arxiv.org/abs/2506.18866"><img src="https://img.shields.io/badge/Arxiv-2506.18866-b31b1b.svg?logo=arXiv"></a>  
<a href="https://huggingface.co/OmniAvatar/OmniAvatar-14B"><img src="https://img.shields.io/badge/π€-OmniAvatar-red.svg"></a>
</div>
</div>

## π₯ Latest News!!
* July 2-nd, 2025: We released the model weights for Wan 1.3B!
* June 24-th, 2025: We released the inference code and model weights!
## Quickstart
### π οΈInstallation
Clone the repo:
```
git clone https://github.com/Omni-Avatar/OmniAvatar
cd OmniAvatar
```
Install dependencies:
```
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt
# Optional to install flash_attn to accelerate attention computation
pip install flash_attn
```
### π§±Model Download
| Models | Download Link | Notes |
|-----------------------|-------------------------------------------------------------------------------|-------------------------------|
| Wan2.1-T2V-14B | π€ [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B) | Base model for 14B
| OmniAvatar model 14B | π€ [Huggingface](https://huggingface.co/OmniAvatar/OmniAvatar-14B) | Our LoRA and audio condition weights
| Wan2.1-T2V-1.3B | π€ [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B) | Base model for 1.3B
| OmniAvatar model 1.3B | π€ [Huggingface](https://huggingface.co/OmniAvatar/OmniAvatar-1.3B) | Our LoRA and audio condition weights
| Wav2Vec | π€ [Huggingface](https://huggingface.co/facebook/wav2vec2-base-960h) | Audio encoder
Download models using huggingface-cli:
``` sh
mkdir pretrained_models
pip install "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.1-T2V-14B --local-dir ./pretrained_models/Wan2.1-T2V-14B
huggingface-cli download facebook/wav2vec2-base-960h --local-dir ./pretrained_models/wav2vec2-base-960h
huggingface-cli download OmniAvatar/OmniAvatar-14B --local-dir ./pretrained_models/OmniAvatar-14B
```
#### File structure (Samples for 14B)
```shell
OmniAvatar
βββ pretrained_models
β βββ Wan2.1-T2V-14B
β β βββ ...
β βββ OmniAvatar-14B
β β βββ config.json
β β βββ pytorch_model.pt
β βββ wav2vec2-base-960h
β βββ ...
```
### π Inference
``` sh
# 480p only for now
# 14B
torchrun --standalone --nproc_per_node=1 scripts/inference.py --config configs/inference.yaml --input_file examples/infer_samples.txt
# 1.3B
torchrun --standalone --nproc_per_node=1 scripts/inference.py --config configs/inference_1.3B.yaml --input_file examples/infer_samples.txt
```
#### π‘Tips
- You can control the character's behavior through the prompt in `examples/infer_samples.txt`, and its format is `[prompt]@@[img_path]@@[audio_path]`. **The recommended range for prompt and audio cfg is [4-6]. You can increase the audio cfg to achieve more consistent lip-sync.**
- Control prompts guidance and audio guidance respectively, and use `audio_scale=3` to control audio guidance separately. At this time, `guidance_scale` only controls prompts.
- To speed up, the recommanded `num_steps` range is [20-50], more steps bring higher quality. To use multi-gpu inference, just set `sp_size=$GPU_NUM`. To use [TeaCache](https://github.com/ali-vilab/TeaCache), you can set `tea_cache_l1_thresh=0.14` , and the recommanded range is [0.05-0.15].
- To reduce GPU memory storage, you can set `use_fsdp=True` and `num_persistent_param_in_dit`. An example command is as follows:
```bash
torchrun --standalone --nproc_per_node=8 scripts/inference.py --config configs/inference.yaml --input_file examples/infer_samples.txt --hp=sp_size=8,max_tokens=30000,guidance_scale=4.5,overlap_frame=13,num_steps=25,use_fsdp=True,tea_cache_l1_thresh=0.14,num_persistent_param_in_dit=7000000000
```
We present a detailed table here. The model is tested on A800.
|`model_size`|`torch_dtype`|`GPU_NUM`|`use_fsdp`|`num_persistent_param_in_dit`|Speed|Required VRAM|
|-|-|-|-|-|-|-|
|14B|torch.bfloat16|1|False|None (unlimited)|16.0s/it|36G|
|14B|torch.bfloat16|1|False|7*10**9 (7B)|19.4s/it|21G|
|14B|torch.bfloat16|1|False|0|22.1s/it|8G|
|14B|torch.bfloat16|4|True|None (unlimited)|4.8s/it|14.3G|
We train train 14B under `30000` tokens for `480p` videos. We found that using more tokens when inference can also have good results. You can try `60000`, `80000`. Overlap `overlap_frame` can be set as `1` or `13`. `13` could have more coherent generation, but error propagation is more severe.
- βPrompts are also very important. It is recommended to `[Description of first frame]`- `[Description of human behavior]`-`[Description of background (optional)]`
## π§© Community Works
We β€οΈ contributions from the open-source community! If your work has improved OmniAvatar, please inform us.
Or you can directly e-mail [ganqijun@zju.edu.cn](mailto:ganqijun@zju.edu.cn). We are happy to reference your project for everyone's convenience. **π₯ΈHave Fun!**
## πCitation
If you find this repository useful, please consider giving a star β and citation
```
@misc{gan2025omniavatar,
title={OmniAvatar: Efficient Audio-Driven Avatar Video Generation with Adaptive Body Animation},
author={Qijun Gan and Ruizi Yang and Jianke Zhu and Shaofei Xue and Steven Hoi},
year={2025},
eprint={2506.18866},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.18866},
}
```
## Acknowledgments
Thanks to [Wan2.1](https://github.com/Wan-Video/Wan2.1), [FantasyTalking](https://github.com/Fantasy-AMAP/fantasy-talking) and [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio) for open-sourcing their models and code, which provided valuable references and support for this project. Their contributions to the open-source community are truly appreciated.
|