---
license: apache-2.0
pipeline_tag: any-to-any
---
# Ming-UniAudio
📑 Technical Report|📖Project Page |🤗 Hugging Face| 🤖 ModelScope
## Introduction
Ming-UniAudio is a novel framework that unifies speech understanding, generation, and editing. Its core is a unified continuous speech tokenizer that effectively unifies semantic and acoustic features within an end-to-end model. We developed a speech language model that strikes a balance between generation and understanding capabilities based on the unified continuous audio tokenizer. Leveraging this foundational model, which exhibits robust performance in both domains, we further trained a dedicated speech editing model built upon [Ming-Lite-Omni](https://github.com/inclusionAI/Ming). Crucially, Ming-UniAudio is the first to enable universal, free-form speech editing guided solely by natural language instructions, handling complex semantic and acoustic modifications without manual region specification.
- 🔥 First unified continuous speech tokenizer for both understanding and generation tasks: [MingTok-Audio](https://github.com/inclusionAI/MingTok-Audio)
- 🔥 First Speech LLM with unifed continuous tokenizer for both understanding and generation: [Ming-UniAudio](https://huggingface.co/inclusionAI/Ming-UniAudio-16B-A3B)
- 🔥 First universal free-form speech editing model for various semantic and acoustic editing task without any temporal regime: [Ming-UniAudio-Edit](https://huggingface.co/inclusionAI/Ming-UniAudio-16B-A3B-Edit)
- 🔥 First benchmark for free-form speech editing: [Ming-Freeform-Audio-Edit-Benchmark](https://huggingface.co/inclusionAI/Ming-Freeform-Audio-Edit-Benchmark)
## 📌 Updates
* [2025.09.30] 🔥 We release [Ming-UniAudio](https://xqacmer.github.io/Ming-Unitok-Audio.github.io/) with significant improvements across speech understanding, generation, and free-form editing tasks.
## Key Features
Ming-UniAudio features key optimizations as follows, compared to other audio-assisted LLMs:
- **Unified Continuous Speech Tokenizer**: Ming-UniAudio proposes a unified continuous speech tokenizer [MingTok-Audio](https://github.com/inclusionAI/MingTok-Audio) based on a VAE framework with a causal Transformer architecture, the first continuous speech tokenizer to effectively integrate semantic and acoustic features, and enables a closed-loop system with LLMs through hierarchical feature representations, makes it suitable for both understanding and generation tasks
- **Unified Speech Language Model for Generation and Understanding**: We pretrain an end-to-end unified speech language model with a single LLM backbone for both understanding and generation tasks, enhanced with a Diffusion Head to ensure high-fidelity speech synthesis.
- **Instruction-Guided Free-Form Speech Editing**: We introduce the first instruction-guided, free-form speech editing framework that supports comprehensive semantic and acoustic edits without requiring explicit edit regions, along with [Ming-Freeform-Audio-Edit](https://huggingface.co/datasets/inclusionAI/Ming-Freeform-Audio-Edit-Benchmark), the first open-source evaluation set for such tasks.
## Evaluation
In various benchmark tests, Ming-UniAudio demonstrates highly competitive results compared to industry-leading models of similar scale.
### Speech Understanding
ASR performance comparison on various audio benchmark datasets. The best results are in bold.
Datasets |
Model |
Performance |
aishell2-ios |
LS-clean |
Hunan |
Minnan |
Guangyue |
Chuanyu |
Shanghai |
Understanding ASR |
Kimi-Audio |
2.56 |
1.28 |
31.93 |
80.28 |
41.49 |
6.69 |
60.64 |
Qwen2.5 Omni |
2.75 |
1.80 |
29.31 |
53.43 |
10.39 |
7.61 |
32.05 |
Qwen2 Audio |
2.92 |
1.60 |
25.88 |
123.78 |
7.59 |
7.77 |
31.73 |
Ming-UniAudio-16B-A3B(ours) |
2.84 |
1.62 |
9.80 |
16.50 |
5.51 |
5.46 |
14.65 |
### Speech Generation
Performance comparison on various audio benchmark datasets. The best results are in bold.
Datasets |
Model |
Performance |
|
|
Seed-zh WER(%) |
Seed-zh SIM |
Seed-en WER(%) |
Seed-en SIM |
Generation |
Seed-TTS |
1.12 |
0.80 |
2.25 |
0.76 |
MiMo-Audio |
1.96 |
- |
5.37 |
- |
Qwen3-Omni-30B-A3B-Instruct |
1.07 |
- |
1.39 |
- |
Ming-Omni-Lite |
1.69 |
0.68 |
4.31 |
0.51 |
Ming-UniAudio-16B-A3B(ours) |
0.95 |
0.70 |
1.85 |
0.58 |
## Model & Benchmark Downloads
You can download our latest model and Benchmark from both Huggingface and ModelScope.
|**Type**| **Model** | **Input modality** | **Oput modality** | **Download** |
|:-----------------------|:-----------------------|:----------------------:| :---------------: |:------------------------------------------------------------------------------------------------------------------------------------------------------------:|
Tokenizer| MingTok-Audio | audio | audio | [🤗 HuggingFace](https://huggingface.co/inclusionAI/MingTok-Audio)
[🤖 ModelScope](https://modelscope.cn/models/inclusionAI/MingTok-Audio) |
SpeechLLM| Ming-UniAudio-16B-A3B | audio | audio | [🤗 HuggingFace](https://huggingface.co/inclusionAI/Ming-UniAudio-16B-A3B)
[🤖 ModelScope](https://modelscope.cn/models/inclusionAI/Ming-UniAudio-16B-A3B) |
SpeechLLM| Ming-UniAudio-16B-A3B-Edit | text, audio | text, audio | [🤗 HuggingFace](https://huggingface.co/inclusionAI/Ming-UniAudio-16B-A3B-Edit)
[🤖 ModelScope](https://modelscope.cn/models/inclusionAI/Ming-UniAudio-16B-A3B-Edit) |
Benchmark| Ming-Freeform-Audio-Edit | - | - | [🤗 HuggingFace](https://huggingface.co/datasets/inclusionAI/Ming-Freeform-Audio-Edit-Benchmark)
[🤖 ModelScope](https://modelscope.cn/datasets/inclusionAI/Ming-Freeform-Audio-Edit-Benchmark)
[Eval tools](https://github.com/inclusionAI/Ming-Freeform-Audio-Edit)|
If you're in mainland China, we strongly recommend you to download our model from 🤖 ModelScope.
```
pip install modelscope
modelscope download --model inclusionAI/Ming-UniAudio-16B-A3B --local_dir inclusionAI/Ming-UniAudio-16B-A3B --revision master
```
Note: This download process will take several minutes to several hours, depending on your network conditions.
## Use Cases
Additional demonstration cases are available on our project [page](https://xqacmer.github.io/Ming-Unitok-Audio.github.io/).
## Environment Preparation
### Installation with pip
```shell
pip install -r requirements.txt
```
### Installation with docker
You can also initialize the environment by building the docker image. First clone this repository:
```shell
git clone --depth 1 https://github.com/inclusionAI/Ming-UniAudio
cd Ming-UniAudio
```
Then build the docker image with the provided Dockerfile in `docker/docker-py310-cu121`. This step might take a while:
```shell
docker build -t ming:py310-cu121 docker/docker-py310-cu121
```
At last, start the container with the current repo directory mounted:
```shell
docker run -it --gpus all -v "$(pwd)":/workspace/Ming-UniAudio ming:py310-cu121 ming:py310-cu121 /bin/bash
```
You can run the model with python interface. You may download the huggingface model in the repo directory first (`.../Ming-UniAudio/`) or mount the downloaded model path when starting the container.
## Example Usage
We provide a step-by-step running example:
Step 1 - Download the source code
```
git clone https://github.com/inclusionAI/Ming-UniAudio
cd Ming-UniAudio
```
Step 2 - Download the Ming-UniAudio model weights and create a soft link to the source code directory
Download our model following `Model & Benchmark Downloads`
```shell
mkdir inclusionAI
ln -s /path/to/inclusionAI/Ming-UniAudio-16B-A3B inclusionAI/Ming-UniAudio-16B-A3B
```
Step 3 - Enter the code directory, you can refer to the following codes to run the Ming-UniAudio model.
```shell
jupyter notebook cookbooks/demo.ipynb
```
We also provide a simple example on the usage of this repo. For detailed usage, please refer to [demobook.ipynb](https://github.com/inclusionAI/Ming-UniAudio/blob/main/cookbooks/demo.ipynb).
```python
import warnings
import torch
from transformers import AutoProcessor
from modeling_bailingmm import BailingMMNativeForConditionalGeneration
import random
import numpy as np
from loguru import logger
def seed_everything(seed=1895):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
seed_everything()
warnings.filterwarnings("ignore")
class MingAudio:
def __init__(self, model_path, device="cuda:0"):
self.device = device
self.model = BailingMMNativeForConditionalGeneration.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
).eval().to(torch.bfloat16).to(self.device)
self.processor = AutoProcessor.from_pretrained(".", trust_remote_code=True)
self.tokenizer = self.processor.tokenizer
self.sample_rate = self.processor.audio_processor.sample_rate
self.patch_size = self.processor.audio_processor.patch_size
def speech_understanding(self, messages):
text = self.processor.apply_chat_template(messages, add_generation_prompt=True)
image_inputs, video_inputs, audio_inputs = self.processor.process_vision_info(messages)
inputs = self.processor(
text=[text],
images=image_inputs,
videos=video_inputs,
audios=audio_inputs,
return_tensors="pt",
).to(self.device)
for k in inputs.keys():
if k == "pixel_values" or k == "pixel_values_videos" or k == "audio_feats":
inputs[k] = inputs[k].to(dtype=torch.bfloat16)
logger.info(f"input: {self.tokenizer.decode(inputs['input_ids'].cpu().numpy().tolist()[0])}")
generated_ids = self.model.generate(
**inputs,
max_new_tokens=512,
eos_token_id=self.processor.gen_terminator,
)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = self.processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
return output_text
def speech_generation(
self,
text,
prompt_wav_path,
prompt_text,
lang='zh',
output_wav_path='out.wav'
):
waveform = self.model.generate_tts(
text=text,
prompt_wav_path=prompt_wav_path,
prompt_text=prompt_text,
patch_size=self.patch_size,
tokenizer=self.tokenizer,
lang=lang,
output_wav_path=output_wav_path,
sample_rate=self.sample_rate,
device=self.device
)
return waveform
def speech_edit(
self,
messages,
output_wav_path='out.wav'
):
text = self.processor.apply_chat_template(messages, add_generation_prompt=True)
image_inputs, video_inputs, audio_inputs = self.processor.process_vision_info(messages)
inputs = self.processor(
text=[text],
images=image_inputs,
videos=video_inputs,
audios=audio_inputs,
return_tensors="pt",
).to(self.device)
ans = torch.tensor([self.tokenizer.encode('')]).to(inputs['input_ids'].device)
inputs['input_ids'] = torch.cat([inputs['input_ids'], ans], dim=1)
attention_mask = inputs['attention_mask']
inputs['attention_mask'] = torch.cat((attention_mask, attention_mask[:, :1]), dim=-1)
for k in inputs.keys():
if k == "pixel_values" or k == "pixel_values_videos" or k == "audio_feats":
inputs[k] = inputs[k].to(dtype=torch.bfloat16)
logger.info(f"input: {self.tokenizer.decode(inputs['input_ids'].cpu().numpy().tolist()[0])}")
edited_speech, edited_text = self.model.generate_edit(
**inputs,
tokenizer=self.tokenizer,
output_wav_path=output_wav_path
)
return edited_speech, edited_text
if __name__ == "__main__":
model = MingAudio("inclusionAI/Ming-UniAudio-16B-A3B")
# ASR
messages = [
{
"role": "HUMAN",
"content": [
{
"type": "text",
"text": "Please recognize the language of this speech and transcribe it. Format: oral.",
},
{"type": "audio", "audio": "data/wavs/BAC009S0915W0292.wav"},
],
},
]
response = model.speech_understanding(messages=messages)
logger.info(f"Generated Response: {response}")
# TTS
model.speech_generation(
text='我们的愿景是构建未来服务业的数字化基础设施,为世界带来更多微小而美好的改变。',
prompt_wav_path='data/wavs/10002287-00000094.wav',
prompt_text='在此奉劝大家别乱打美白针。',
)
```
Note: We test the examples on hardware of NVIDIA H800-80GB/H20-96G with CUDA 12.4.
## Citation
If you find our work helpful, feel free to give us a cite.