File size: 4,795 Bytes
d59585b 60370bf da59973 c5d1215 da59973 60370bf da59973 60370bf da59973 60370bf da59973 60370bf 277efb1 da59973 60370bf da59973 60370bf da59973 60370bf da59973 60370bf da59973 60370bf da59973 60370bf da59973 60370bf da59973 60370bf |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
---
license: mit
tags:
- text-to-audio
- controlnet
pipeline_tag: text-to-audio
library_name: diffusers
---
<img src="https://github.com/haidog-yaqub/EzAudio/blob/main/arts/ezaudio.png?raw=true">
# EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer
[EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer](https://huggingface.co/papers/2409.10819)
**Abstract:** We introduce EzAudio, a text-to-audio (T2A) generation framework designed to produce high-quality, natural-sounding sound effects. Core designs include: (1) We propose EzAudio-DiT, an optimized Diffusion Transformer (DiT) designed for audio latent representations, improving convergence speed, as well as parameter and memory efficiency. (2) We apply a classifier-free guidance (CFG) rescaling technique to mitigate fidelity loss at higher CFG scores and enhancing prompt adherence without compromising audio quality. (3) We propose a synthetic caption generation strategy leveraging recent advances in audio understanding and LLMs to enhance T2A pretraining. We show that EzAudio, with its computationally efficient architecture and fast convergence, is a competitive open-source model that excels in both objective and subjective evaluations by delivering highly realistic listening experiences. Code, data, and pre-trained models are released at: this https URL .
[](https://haidog-yaqub.github.io/EzAudio-Page/)
[](https://arxiv.org/abs/2409.10819)
[](https://huggingface.co/spaces/OpenSound/EzAudio)
🟣 EzAudio is a diffusion-based text-to-audio generation model. Designed for real-world audio applications, EzAudio brings together high-quality audio synthesis with lower computational demands.
🎛 Play with EzAudio for text-to-audio generation, editing, and inpainting: [EzAudio Space](https://huggingface.co/spaces/OpenSound/EzAudio)
🎮 EzAudio-ControlNet is available: [EzAudio-ControlNet Space](https://huggingface.co/spaces/OpenSound/EzAudio-ControlNet)
<!-- We want to thank Hugging Face Space and Gradio for providing incredible demo platform. -->
## Installation
Clone the repository:
```
git clone git@github.com:haidog-yaqub/EzAudio.git
```
Install the dependencies:
```
cd EzAudio
pip install -r requirements.txt
```
Download checkponts (Optional):
[https://huggingface.co/OpenSound/EzAudio](https://huggingface.co/OpenSound/EzAudio/tree/main)
## Usage
You can use the model with the following code:
```python
from api.ezaudio import EzAudio
import torch
import soundfile as sf
# load model
device = 'cuda' if torch.cuda.is_available() else 'cpu'
ezaudio = EzAudio(model_name='s3_xl', device=device)
# text to audio genertation
prompt = "a dog barking in the distance"
sr, audio = ezaudio.generate_audio(prompt)
sf.write(f'{prompt}.wav', audio, sr)
# audio inpainting
prompt = "A train passes by, blowing its horns"
original_audio = 'ref.wav'
sr, audio = ezaudio.editing_audio(prompt, boundary=2, gt_file=original_audio,
mask_start=1, mask_length=5)
sf.write(f'{prompt}_edit.wav', audio, sr)
```
## Training
#### Autoencoder
Refer to the VAE training section in our work [SoloAudio](https://github.com/WangHelin1997/SoloAudio)
#### T2A Diffusion Model
Prepare your data (see example in `src/dataset/meta_example.csv`), then run:
```bash
cd src
accelerate launch train.py
```
## Todo
- [x] Release Gradio Demo along with checkpoints [EzAudio Space](https://huggingface.co/spaces/OpenSound/EzAudio)
- [x] Release ControlNet Demo along with checkpoints [EzAudio ControlNet Space](https://huggingface.co/spaces/OpenSound/EzAudio-ControlNet)
- [x] Release inference code
- [x] Release training pipeline and dataset
- [x] Improve API and support automatic ckpts downloading
- [ ] Release checkpoints for stage1 and stage2 [WIP]
## Reference
If you find the code useful for your research, please consider citing:
```bibtex
@article{hai2024ezaudio,
title={EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer},
author={Hai, Jiarui and Xu, Yong and Zhang, Hao and Li, Chenxing and Wang, Helin and Elhilali, Mounya and Yu, Dong},
journal={arXiv preprint arXiv:2409.10819},
year={2024}
}
```
## Acknowledgement
Some codes are borrowed from or inspired by: [U-Vit](https://github.com/baofff/U-ViT), [Pixel-Art](https://github.com/PixArt-alpha/PixArt-alpha), [Huyuan-DiT](https://github.com/Tencent/HunyuanDiT), and [Stable Audio](https://github.com/Stability-AI/stable-audio-tools). |