WenetSpeech-Yue: A Large-scale Cantonese Speech Corpus with Multi-dimensional Annotation

Longhao Li¹, Zhao Guo¹, Hongjie Chen², Yuhang Dai¹, Ziyu Zhang¹, Hongfei Xue¹, Tianlun Zuo¹, Chengyou Wang¹, Shuiyuan Wang¹, Xin Xu³, Hui Bu³, Jie Li², Jian Kang², Binbin Zhang⁴, Ruibin Yuan⁵, Ziya Zhou⁵, Wei Xue⁵, Lei Xie¹

¹ ASLP, Northwestern Polytechnical University, ² Institute of Artificial Intelligence (TeleAI), China Telecom, ³ Beijing AISHELL Technology Co., Ltd., ⁴ WeNet Open Source Community, ⁵ Hong Kong University of Science and Technology

📑 Paper | 🐙 GitHub | 🤗 HuggingFace
🖥️ HuggingFace Space | 🎤 Demo Page | 💬 Contact Us

ASR Leaderboard

Model	#Params (M)	In-House		Open-Source					WSYue-eval
Model	#Params (M)	Dialogue	Reading	yue	HK	MDCC	Daily_Use	Commands	Short	Long
w/o LLM
Conformer-Yue⭐	130	16.57	7.82	7.72	11.42	5.73	5.73	8.97	5.05	8.89
Paraformer	220	83.22	51.97	70.16	68.49	47.67	79.31	69.32	73.64	89.00
SenseVoice-small	234	21.08	6.52	8.05	7.34	6.34	5.74	6.65	6.69	9.95
SenseVoice-s-Yue⭐	234	19.19	6.71	6.87	8.68	5.43	5.24	6.93	5.23	8.63
Dolphin-small	372	59.20	7.38	39.69	51.29	26.39	7.21	9.68	32.32	58.20
TeleASR	700	37.18	7.27	7.02	7.88	6.25	8.02	5.98	6.23	11.33
Whisper-medium	769	75.50	68.69	59.44	62.50	62.31	64.41	80.41	80.82	50.96
Whisper-m-Yue⭐	769	18.69	6.86	6.86	11.03	5.49	4.70	8.51	5.05	8.05
FireRedASR-AED-L	1100	73.70	18.72	43.93	43.33	34.53	48.05	49.99	55.37	50.26
Whisper-large-v3	1550	45.09	15.46	12.85	16.36	14.63	17.84	20.70	12.95	26.86
w/ LLM
Qwen2.5-Omni-3B	3000	72.01	7.49	12.59	11.75	38.91	10.59	25.78	67.95	88.46
Kimi-Audio	7000	68.65	24.34	40.90	38.72	30.72	44.29	45.54	50.86	33.49
FireRedASR-LLM-L	8300	73.70	18.72	43.93	43.33	34.53	48.05	49.99	49.87	45.92
Conformer-LLM-Yue⭐	4200	17.22	6.21	6.23	9.52	4.35	4.57	6.98	4.73	7.91

ASR Inference

U2pp_Conformer_Yue

dir=u2pp_conformer_yue
decode_checkpoint=$dir/u2pp_conformer_yue.pt
test_set=path/to/test_set
test_result_dir=path/to/test_result_dir

python wenet/bin/recognize.py \
  --gpu 0 \
  --modes attention_rescoring \
  --config $dir/train.yaml \
  --test_data $test_set/data.list \
  --checkpoint $decode_checkpoint \
  --beam_size 10 \
  --batch_size 32 \
  --ctc_weight 0.5 \
  --result_dir $test_result_dir \
  --decoding_chunk_size -1

Whisper_Medium_Yue

dir=whisper_medium_yue
decode_checkpoint=$dir/whisper_medium_yue.pt
test_set=path/to/test_set
test_result_dir=path/to/test_result_dir

python wenet/bin/recognize.py \
  --gpu 0 \
  --modes attention \
  --config $dir/train.yaml \
  --test_data $test_set/data.list \
  --checkpoint $decode_checkpoint \
  --beam_size 10 \
  --batch_size 32 \
  --blank_penalty 0.0 \
  --ctc_weight 0.0 \
  --reverse_weight 0.0 \
  --result_dir $test_result_dir \
  --decoding_chunk_size -1

SenseVoice_Small_Yue

from funasr import AutoModel
from funasr.utils.postprocess_utils import rich_transcription_postprocess

model_dir = "sensevoice_small_yue"

model = AutoModel(
        model=model_path,
        device="cuda:0",
    )
res = model.generate(
    wav_path,
    cache={},
    language="yue",
    use_itn=True,
    batch_size=64,
)

ASLP-lab
/

WSYue-ASR