📑 Technical Report｜📖Project Page ｜🤗 Hugging Face｜ 🤖 ModelScope ## Introduction Ming-UniAudio is a novel framework that unifies speech understanding, generation, and editing. Its core is a unified continuous speech tokenizer that effectively unifies semantic and acoustic features within an end-to-end model. We developed a speech language model that strikes a balance between generation and understanding capabilities based on the unified continuous audio tokenizer. Leveraging this foundational model, which exhibits robust performance in both domains, we further trained a dedicated speech editing model built upon [Ming-Lite-Omni](https://github.com/inclusionAI/Ming). Crucially, Ming-UniAudio is the first to enable universal, free-form speech editing guided solely by natural language instructions, handling complex semantic and acoustic modifications without manual region specification. - 🔥 First unified continuous speech tokenizer for both understanding and generation tasks: [MingTok-Audio](https://github.com/inclusionAI/MingTok-Audio) - 🔥 First Speech LLM with unifed continuous tokenizer for both understanding and generation: [Ming-UniAudio](https://huggingface.co/inclusionAI/Ming-UniAudio-16B-A3B) - 🔥 First universal free-form speech editing model for various semantic and acoustic editing task without any temporal regime: [Ming-UniAudio-Edit](https://huggingface.co/inclusionAI/Ming-UniAudio-16B-A3B-Edit) - 🔥 First benchmark for free-form speech editing: [Ming-Freeform-Audio-Edit-Benchmark](https://huggingface.co/inclusionAI/Ming-Freeform-Audio-Edit-Benchmark)

## 📌 Updates * [2025.09.30] 🔥 We release [Ming-UniAudio](https://xqacmer.github.io/Ming-Unitok-Audio.github.io/) with significant improvements across speech understanding, generation, and free-form editing tasks. ## Key Features Ming-UniAudio features key optimizations as follows, compared to other audio-assisted LLMs: - **Unified Continuous Speech Tokenizer**: Ming-UniAudio proposes a unified continuous speech tokenizer [MingTok-Audio](https://github.com/inclusionAI/MingTok-Audio) based on a VAE framework with a causal Transformer architecture, the first continuous speech tokenizer to effectively integrate semantic and acoustic features, and enables a closed-loop system with LLMs through hierarchical feature representations, makes it suitable for both understanding and generation tasks - **Unified Speech Language Model for Generation and Understanding**: We pretrain an end-to-end unified speech language model with a single LLM backbone for both understanding and generation tasks, enhanced with a Diffusion Head to ensure high-fidelity speech synthesis. - **Instruction-Guided Free-Form Speech Editing**: We introduce the first instruction-guided, free-form speech editing framework that supports comprehensive semantic and acoustic edits without requiring explicit edit regions, along with [Ming-Freeform-Audio-Edit](https://huggingface.co/datasets/inclusionAI/Ming-Freeform-Audio-Edit-Benchmark), the first open-source evaluation set for such tasks. ## Evaluation In various benchmark tests, Ming-UniAudio demonstrates highly competitive results compared to industry-leading models of similar scale. ### Speech Understanding

ASR performance comparison on various audio benchmark datasets. The best results are in **bold**.
Datasets	Model	Performance
Datasets	Model	aishell2-ios	LS-clean	Hunan	Minnan	Guangyue	Chuanyu	Shanghai
Understanding ASR	Kimi-Audio	2.56	1.28	31.93	80.28	41.49	6.69	60.64
	Qwen2.5 Omni	2.75	1.80	29.31	53.43	10.39	7.61	32.05
	Qwen2 Audio	2.92	1.60	25.88	123.78	7.59	7.77	31.73
	Ming-UniAudio-16B-A3B(ours)	2.84	1.62	9.80	16.50	5.51	5.46	14.65

### Speech Generation

Performance comparison on various audio benchmark datasets. The best results are in **bold**.
Datasets	Model	Performance
		Seed-zh WER(%)	Seed-zh SIM	Seed-en WER(%)	Seed-en SIM
Generation	Seed-TTS	1.12	0.80	2.25	0.76
	MiMo-Audio	1.96	-	5.37	-
	Qwen3-Omni-30B-A3B-Instruct	1.07	-	1.39	-
	Ming-Omni-Lite	1.69	0.68	4.31	0.51
	Ming-UniAudio-16B-A3B(ours)	0.95	0.70	1.85	0.58

## Model & Benchmark Downloads You can download our latest model and Benchmark from both Huggingface and ModelScope.

|**Type**| **Model** | **Input modality** | **Oput modality** | **Download** | |:-----------------------|:-----------------------|:----------------------:| :---------------: |:------------------------------------------------------------------------------------------------------------------------------------------------------------:| Tokenizer| MingTok-Audio | audio | audio | [🤗 HuggingFace](https://huggingface.co/inclusionAI/MingTok-Audio)
[🤖 ModelScope](https://modelscope.cn/models/inclusionAI/MingTok-Audio) | SpeechLLM| Ming-UniAudio-16B-A3B | audio | audio | [🤗 HuggingFace](https://huggingface.co/inclusionAI/Ming-UniAudio-16B-A3B)
[🤖 ModelScope](https://modelscope.cn/models/inclusionAI/Ming-UniAudio-16B-A3B) | SpeechLLM| Ming-UniAudio-16B-A3B-Edit | text, audio | text, audio | [🤗 HuggingFace](https://huggingface.co/inclusionAI/Ming-UniAudio-16B-A3B-Edit)
[🤖 ModelScope](https://modelscope.cn/models/inclusionAI/Ming-UniAudio-16B-A3B-Edit) | Benchmark| Ming-Freeform-Audio-Edit | - | - | [🤗 HuggingFace](https://huggingface.co/datasets/inclusionAI/Ming-Freeform-Audio-Edit-Benchmark)
[🤖 ModelScope](https://modelscope.cn/datasets/inclusionAI/Ming-Freeform-Audio-Edit-Benchmark)
[Eval tools](https://github.com/inclusionAI/Ming-Freeform-Audio-Edit)|