Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing
Paper β’ 2604.10708 β’ Published β’ 38
Unified Audio Understanding, Generation, and Editing (SIGGRAPH 2026)
Audio-Omni is the first end-to-end framework that unifies understanding, generation, and editing across general sound, music, and speech domains. It combines a frozen Multimodal Large Language Model (Qwen2.5-Omni) for high-level reasoning with a trainable Diffusion Transformer for high-fidelity synthesis.
# Clone the GitHub repository
git clone https://github.com/ZeyueT/Audio-Omni.git
cd Audio-Omni
# Install dependencies
pip install -e .
conda install -c conda-forge ffmpeg libsndfile
# Download model from Hugging Face
huggingface-cli download HKUSTAudio/Audio-Omni --local-dir model/
from audio_omni import AudioOmni
import torchaudio
# Load model
model = AudioOmni("model/Audio-Omni.json", "model/model.ckpt")
# 1. Understanding
response = model.understand(
"Describe the sounds in this audio.",
audio="example.wav"
)
print(response)
# 2. Generation (Text-to-Audio)
audio = model.generate("T2A", prompt="A clock ticking.")
torchaudio.save("output.wav", audio, model.sample_rate)
# 3. Editing (Add a sound)
audio = model.edit("Add", "input.wav", desc="skateboarding")
torchaudio.save("output_add.wav", audio, model.sample_rate)
Audio-Omni.json β Model configurationmodel.ckpt β Model checkpoint (~21 GB)synchformer_state_dict.pth β Synchformer checkpoint for video conditioning# Launch interactive demo
python run_gradio.py \
--model-config model/Audio-Omni.json \
--ckpt-path model/model.ckpt \
--server-port 7777
@article{tian2026audioomni,
title={Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing},
author={Tian, Zeyue and Yang, Binxin and Liu, Zhaoyang and Zhang, Jiexuan and Yuan, Ruibin and Yin, Hubery and Chen, Qifeng and Li, Chen and Lv, Jing and Xue, Wei and Guo, Yike},
journal={arXiv preprint arXiv:2604.10708},
year={2026}
}
CC-BY-NC-4.0 (Non-commercial use only). Commercial use of the model weights requires explicit written authorization from the authors. For commercial licensing inquiries, contact: ztianad@connect.ust.hk