--- license: apache-2.0 pipeline_tag: any-to-any --- # Ming-UniAudio

📑 Technical Report|📖Project Page |🤗 Hugging Face| 🤖 ModelScope ## Introduction Ming-UniAudio is a novel framework that unifies speech understanding, generation, and editing. Its core is a unified continuous speech tokenizer that effectively unifies semantic and acoustic features within an end-to-end model. We developed a speech language model that strikes a balance between generation and understanding capabilities based on the unified continuous audio tokenizer. Leveraging this foundational model, which exhibits robust performance in both domains, we further trained a dedicated speech editing model built upon [Ming-Lite-Omni](https://github.com/inclusionAI/Ming). Crucially, Ming-UniAudio is the first to enable universal, free-form speech editing guided solely by natural language instructions, handling complex semantic and acoustic modifications without manual region specification. - 🔥 First unified continuous speech tokenizer for both understanding and generation tasks: [MingTok-Audio](https://github.com/inclusionAI/MingTok-Audio) - 🔥 First Speech LLM with unifed continuous tokenizer for both understanding and generation: [Ming-UniAudio](https://huggingface.co/inclusionAI/Ming-UniAudio-16B-A3B) - 🔥 First universal free-form speech editing model for various semantic and acoustic editing task without any temporal regime: [Ming-UniAudio-Edit](https://huggingface.co/inclusionAI/Ming-UniAudio-16B-A3B-Edit) - 🔥 First benchmark for free-form speech editing: [Ming-Freeform-Audio-Edit-Benchmark](https://huggingface.co/inclusionAI/Ming-Freeform-Audio-Edit-Benchmark)

## 📌 Updates * [2025.09.30] 🔥 We release [Ming-UniAudio](https://xqacmer.github.io/Ming-Unitok-Audio.github.io/) with significant improvements across speech understanding, generation, and free-form editing tasks. ## Key Features Ming-UniAudio features key optimizations as follows, compared to other audio-assisted LLMs: - **Unified Continuous Speech Tokenizer**: Ming-UniAudio proposes a unified continuous speech tokenizer [MingTok-Audio](https://github.com/inclusionAI/MingTok-Audio) based on a VAE framework with a causal Transformer architecture, the first continuous speech tokenizer to effectively integrate semantic and acoustic features, and enables a closed-loop system with LLMs through hierarchical feature representations, makes it suitable for both understanding and generation tasks - **Unified Speech Language Model for Generation and Understanding**: We pretrain an end-to-end unified speech language model with a single LLM backbone for both understanding and generation tasks, enhanced with a Diffusion Head to ensure high-fidelity speech synthesis. - **Instruction-Guided Free-Form Speech Editing**: We introduce the first instruction-guided, free-form speech editing framework that supports comprehensive semantic and acoustic edits without requiring explicit edit regions, along with [Ming-Freeform-Audio-Edit](https://huggingface.co/datasets/inclusionAI/Ming-Freeform-Audio-Edit-Benchmark), the first open-source evaluation set for such tasks. ## Evaluation In various benchmark tests, Ming-UniAudio demonstrates highly competitive results compared to industry-leading models of similar scale. ### Speech Understanding
ASR performance comparison on various audio benchmark datasets. The best results are in bold.
Datasets Model Performance
aishell2-ios LS-clean Hunan Minnan Guangyue Chuanyu Shanghai
Understanding ASR Kimi-Audio 2.56 1.28 31.93 80.28 41.49 6.69 60.64
Qwen2.5 Omni 2.75 1.80 29.31 53.43 10.39 7.61 32.05
Qwen2 Audio 2.92 1.60 25.88 123.78 7.59 7.77 31.73
Ming-UniAudio-16B-A3B(ours) 2.84 1.62 9.80 16.50 5.51 5.46 14.65
### Speech Generation
Performance comparison on various audio benchmark datasets. The best results are in bold.
Datasets Model Performance
Seed-zh WER(%) Seed-zh SIM Seed-en WER(%) Seed-en SIM
Generation Seed-TTS 1.12 0.80 2.25 0.76
MiMo-Audio 1.96 - 5.37 -
Qwen3-Omni-30B-A3B-Instruct 1.07 - 1.39 -
Ming-Omni-Lite 1.69 0.68 4.31 0.51
Ming-UniAudio-16B-A3B(ours) 0.95 0.70 1.85 0.58
## Model & Benchmark Downloads You can download our latest model and Benchmark from both Huggingface and ModelScope.

|**Type**| **Model** | **Input modality** | **Oput modality** | **Download** | |:-----------------------|:-----------------------|:----------------------:| :---------------: |:------------------------------------------------------------------------------------------------------------------------------------------------------------:| Tokenizer| MingTok-Audio | audio | audio | [🤗 HuggingFace](https://huggingface.co/inclusionAI/MingTok-Audio)
[🤖 ModelScope](https://modelscope.cn/models/inclusionAI/MingTok-Audio) | SpeechLLM| Ming-UniAudio-16B-A3B | audio | audio | [🤗 HuggingFace](https://huggingface.co/inclusionAI/Ming-UniAudio-16B-A3B)
[🤖 ModelScope](https://modelscope.cn/models/inclusionAI/Ming-UniAudio-16B-A3B) | SpeechLLM| Ming-UniAudio-16B-A3B-Edit | text, audio | text, audio | [🤗 HuggingFace](https://huggingface.co/inclusionAI/Ming-UniAudio-16B-A3B-Edit)
[🤖 ModelScope](https://modelscope.cn/models/inclusionAI/Ming-UniAudio-16B-A3B-Edit) | Benchmark| Ming-Freeform-Audio-Edit | - | - | [🤗 HuggingFace](https://huggingface.co/datasets/inclusionAI/Ming-Freeform-Audio-Edit-Benchmark)
[🤖 ModelScope](https://modelscope.cn/datasets/inclusionAI/Ming-Freeform-Audio-Edit-Benchmark)
[Eval tools](https://github.com/inclusionAI/Ming-Freeform-Audio-Edit)|
If you're in mainland China, we strongly recommend you to download our model from 🤖 ModelScope. ``` pip install modelscope modelscope download --model inclusionAI/Ming-UniAudio-16B-A3B --local_dir inclusionAI/Ming-UniAudio-16B-A3B --revision master ``` Note: This download process will take several minutes to several hours, depending on your network conditions. ## Use Cases Additional demonstration cases are available on our project [page](https://xqacmer.github.io/Ming-Unitok-Audio.github.io/). ## Environment Preparation ### Installation with pip ```shell pip install -r requirements.txt ``` ### Installation with docker You can also initialize the environment by building the docker image. First clone this repository: ```shell git clone --depth 1 https://github.com/inclusionAI/Ming-UniAudio cd Ming-UniAudio ``` Then build the docker image with the provided Dockerfile in `docker/docker-py310-cu121`. This step might take a while: ```shell docker build -t ming:py310-cu121 docker/docker-py310-cu121 ``` At last, start the container with the current repo directory mounted: ```shell docker run -it --gpus all -v "$(pwd)":/workspace/Ming-UniAudio ming:py310-cu121 ming:py310-cu121 /bin/bash ``` You can run the model with python interface. You may download the huggingface model in the repo directory first (`.../Ming-UniAudio/`) or mount the downloaded model path when starting the container. ## Example Usage We provide a step-by-step running example: Step 1 - Download the source code ``` git clone https://github.com/inclusionAI/Ming-UniAudio cd Ming-UniAudio ``` Step 2 - Download the Ming-UniAudio model weights and create a soft link to the source code directory Download our model following `Model & Benchmark Downloads` ```shell mkdir inclusionAI ln -s /path/to/inclusionAI/Ming-UniAudio-16B-A3B inclusionAI/Ming-UniAudio-16B-A3B ``` Step 3 - Enter the code directory, you can refer to the following codes to run the Ming-UniAudio model. ```shell jupyter notebook cookbooks/demo.ipynb ``` We also provide a simple example on the usage of this repo. For detailed usage, please refer to [demobook.ipynb](https://github.com/inclusionAI/Ming-UniAudio/blob/main/cookbooks/demo.ipynb). ```python import warnings import torch from transformers import AutoProcessor from modeling_bailingmm import BailingMMNativeForConditionalGeneration import random import numpy as np from loguru import logger def seed_everything(seed=1895): random.seed(seed) np.random.seed(seed) torch.manual_seed(seed) torch.cuda.manual_seed(seed) torch.cuda.manual_seed_all(seed) torch.backends.cudnn.deterministic = True torch.backends.cudnn.benchmark = False seed_everything() warnings.filterwarnings("ignore") class MingAudio: def __init__(self, model_path, device="cuda:0"): self.device = device self.model = BailingMMNativeForConditionalGeneration.from_pretrained( model_path, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, ).eval().to(torch.bfloat16).to(self.device) self.processor = AutoProcessor.from_pretrained(".", trust_remote_code=True) self.tokenizer = self.processor.tokenizer self.sample_rate = self.processor.audio_processor.sample_rate self.patch_size = self.processor.audio_processor.patch_size def speech_understanding(self, messages): text = self.processor.apply_chat_template(messages, add_generation_prompt=True) image_inputs, video_inputs, audio_inputs = self.processor.process_vision_info(messages) inputs = self.processor( text=[text], images=image_inputs, videos=video_inputs, audios=audio_inputs, return_tensors="pt", ).to(self.device) for k in inputs.keys(): if k == "pixel_values" or k == "pixel_values_videos" or k == "audio_feats": inputs[k] = inputs[k].to(dtype=torch.bfloat16) logger.info(f"input: {self.tokenizer.decode(inputs['input_ids'].cpu().numpy().tolist()[0])}") generated_ids = self.model.generate( **inputs, max_new_tokens=512, eos_token_id=self.processor.gen_terminator, ) generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = self.processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0] return output_text def speech_generation( self, text, prompt_wav_path, prompt_text, lang='zh', output_wav_path='out.wav' ): waveform = self.model.generate_tts( text=text, prompt_wav_path=prompt_wav_path, prompt_text=prompt_text, patch_size=self.patch_size, tokenizer=self.tokenizer, lang=lang, output_wav_path=output_wav_path, sample_rate=self.sample_rate, device=self.device ) return waveform def speech_edit( self, messages, output_wav_path='out.wav' ): text = self.processor.apply_chat_template(messages, add_generation_prompt=True) image_inputs, video_inputs, audio_inputs = self.processor.process_vision_info(messages) inputs = self.processor( text=[text], images=image_inputs, videos=video_inputs, audios=audio_inputs, return_tensors="pt", ).to(self.device) ans = torch.tensor([self.tokenizer.encode('')]).to(inputs['input_ids'].device) inputs['input_ids'] = torch.cat([inputs['input_ids'], ans], dim=1) attention_mask = inputs['attention_mask'] inputs['attention_mask'] = torch.cat((attention_mask, attention_mask[:, :1]), dim=-1) for k in inputs.keys(): if k == "pixel_values" or k == "pixel_values_videos" or k == "audio_feats": inputs[k] = inputs[k].to(dtype=torch.bfloat16) logger.info(f"input: {self.tokenizer.decode(inputs['input_ids'].cpu().numpy().tolist()[0])}") edited_speech, edited_text = self.model.generate_edit( **inputs, tokenizer=self.tokenizer, output_wav_path=output_wav_path ) return edited_speech, edited_text if __name__ == "__main__": model = MingAudio("inclusionAI/Ming-UniAudio-16B-A3B") # ASR messages = [ { "role": "HUMAN", "content": [ { "type": "text", "text": "Please recognize the language of this speech and transcribe it. Format: oral.", }, {"type": "audio", "audio": "data/wavs/BAC009S0915W0292.wav"}, ], }, ] response = model.speech_understanding(messages=messages) logger.info(f"Generated Response: {response}") # TTS model.speech_generation( text='我们的愿景是构建未来服务业的数字化基础设施,为世界带来更多微小而美好的改变。', prompt_wav_path='data/wavs/10002287-00000094.wav', prompt_text='在此奉劝大家别乱打美白针。', ) ``` Note: We test the examples on hardware of NVIDIA H800-80GB/H20-96G with CUDA 12.4. ## Citation If you find our work helpful, feel free to give us a cite.