File size: 4,795 Bytes
d59585b
 
 
 
 
60370bf
 
da59973
 
c5d1215
da59973
 
 
60370bf
 
 
 
 
 
 
 
da59973
 
60370bf
da59973
60370bf
da59973
60370bf
277efb1
da59973
 
 
 
 
 
 
 
 
 
 
60370bf
 
 
da59973
 
 
 
 
 
60370bf
 
 
da59973
 
60370bf
 
da59973
60370bf
da59973
60370bf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
da59973
60370bf
 
 
da59973
 
 
 
 
60370bf
 
 
 
da59973
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60370bf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
---
license: mit
tags:
- text-to-audio
- controlnet
pipeline_tag: text-to-audio
library_name: diffusers
---

<img src="https://github.com/haidog-yaqub/EzAudio/blob/main/arts/ezaudio.png?raw=true">

# EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer

[EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer](https://huggingface.co/papers/2409.10819)

**Abstract:** We introduce EzAudio, a text-to-audio (T2A) generation framework designed to produce high-quality, natural-sounding sound effects. Core designs include: (1) We propose EzAudio-DiT, an optimized Diffusion Transformer (DiT) designed for audio latent representations, improving convergence speed, as well as parameter and memory efficiency. (2) We apply a classifier-free guidance (CFG) rescaling technique to mitigate fidelity loss at higher CFG scores and enhancing prompt adherence without compromising audio quality. (3) We propose a synthetic caption generation strategy leveraging recent advances in audio understanding and LLMs to enhance T2A pretraining. We show that EzAudio, with its computationally efficient architecture and fast convergence, is a competitive open-source model that excels in both objective and subjective evaluations by delivering highly realistic listening experiences. Code, data, and pre-trained models are released at: this https URL .

[![Official Page](https://img.shields.io/badge/Official%20Page-EzAudio-blue?logo=Github&style=flat-square)](https://haidog-yaqub.github.io/EzAudio-Page/)
[![arXiv](https://img.shields.io/badge/arXiv-2409.10819-brightgreen.svg?style=flat-square)](https://arxiv.org/abs/2409.10819)
[![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/spaces/OpenSound/EzAudio)

🟣 EzAudio is a diffusion-based text-to-audio generation model. Designed for real-world audio applications, EzAudio brings together high-quality audio synthesis with lower computational demands.

🎛 Play with EzAudio for text-to-audio generation, editing, and inpainting: [EzAudio Space](https://huggingface.co/spaces/OpenSound/EzAudio)

🎮 EzAudio-ControlNet is available: [EzAudio-ControlNet Space](https://huggingface.co/spaces/OpenSound/EzAudio-ControlNet)

<!-- We want to thank Hugging Face Space and Gradio for providing incredible demo platform. -->

## Installation

Clone the repository:
```
git clone git@github.com:haidog-yaqub/EzAudio.git
```
Install the dependencies:
```
cd EzAudio
pip install -r requirements.txt
```

Download checkponts (Optional):
[https://huggingface.co/OpenSound/EzAudio](https://huggingface.co/OpenSound/EzAudio/tree/main)

## Usage

You can use the model with the following code:

```python
from api.ezaudio import EzAudio
import torch
import soundfile as sf

# load model
device = 'cuda' if torch.cuda.is_available() else 'cpu'
ezaudio = EzAudio(model_name='s3_xl', device=device)

# text to audio genertation
prompt = "a dog barking in the distance"
sr, audio = ezaudio.generate_audio(prompt)
sf.write(f'{prompt}.wav', audio, sr)

# audio inpainting
prompt = "A train passes by, blowing its horns"
original_audio = 'ref.wav'
sr, audio = ezaudio.editing_audio(prompt, boundary=2, gt_file=original_audio,
                                  mask_start=1, mask_length=5)
sf.write(f'{prompt}_edit.wav', audio, sr)
```

## Training

#### Autoencoder
Refer to the VAE training section in our work [SoloAudio](https://github.com/WangHelin1997/SoloAudio)

#### T2A Diffusion Model
Prepare your data (see example in `src/dataset/meta_example.csv`), then run:

```bash
cd src
accelerate launch train.py
```

## Todo
- [x] Release Gradio Demo along with checkpoints [EzAudio Space](https://huggingface.co/spaces/OpenSound/EzAudio)
- [x] Release ControlNet Demo along with checkpoints [EzAudio ControlNet Space](https://huggingface.co/spaces/OpenSound/EzAudio-ControlNet)
- [x] Release inference code
- [x] Release training pipeline and dataset
- [x] Improve API and support automatic ckpts downloading
- [ ] Release checkpoints for stage1 and stage2 [WIP]

## Reference

If you find the code useful for your research, please consider citing:

```bibtex
@article{hai2024ezaudio,
  title={EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer},
  author={Hai, Jiarui and Xu, Yong and Zhang, Hao and Li, Chenxing and Wang, Helin and Elhilali, Mounya and Yu, Dong},
  journal={arXiv preprint arXiv:2409.10819},
  year={2024}
}
```

## Acknowledgement
Some codes are borrowed from or inspired by: [U-Vit](https://github.com/baofff/U-ViT), [Pixel-Art](https://github.com/PixArt-alpha/PixArt-alpha), [Huyuan-DiT](https://github.com/Tencent/HunyuanDiT), and [Stable Audio](https://github.com/Stability-AI/stable-audio-tools).