File size: 7,822 Bytes
7717f43 a02ffa0 7717f43 803086f a997529 db7818e eda663e 3324e01 1539446 7717f43 1539446 7717f43 1539446 7717f43 1539446 7717f43 1539446 8d2de97 98fe512 8d2de97 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 |
---
license: apache-2.0
---
# WenetSpeech-Yue: A Large-scale Cantonese Speech Corpus with Multi-dimensional Annotation
<div>
<img width="800px" src="https://github.com/ASLP-lab/WenetSpeech-Yue/raw/main/figs/wenetspeech_yue.svg" />
</div>
## π Project Tree
The structure of **WSYue-ASR** is organized as follows:
```
WSYue-ASR
βββ sensevoice_small_yue/
β βββ config.yaml
β βββ configuration.json
β βββ model.pt
β
βββ u2pp_conformer_yue/
β βββ bpe.model
β βββ lang_char.txt
β βββ train.yaml
β βββ u2pp_conformer_yue.pt
β
βββ whisper_medium_yue/
β βββ train.yaml
β βββ whisper_medium_yue.py
β
βββ .gitattributes
βββ README.md
```
## ASR Leaderboard
<table border="0" cellspacing="0" cellpadding="6" style="border-collapse:collapse;">
<tr>
<th align="left" rowspan="2">Model</th>
<th align="center" rowspan="2">#Params (M)</th>
<th align="center" colspan="2">In-House</th>
<th align="center" colspan="5">Open-Source</th>
<th align="center" colspan="2">WSYue-eval</th>
</tr>
<tr>
<th align="center">Dialogue</th>
<th align="center">Reading</th>
<th align="center">yue</th>
<th align="center">HK</th>
<th align="center">MDCC</th>
<th align="center">Daily_Use</th>
<th align="center">Commands</th>
<th align="center">Short</th>
<th align="center">Long</th>
</tr>
<tr><td align="left" colspan="11"><b>w/o LLM</b></td></tr>
<tr>
<td align="left"><b>Conformer-Yueβ</b></td><td align="center">130</td><td align="center"><b>16.57</b></td><td align="center">7.82</td><td align="center">7.72</td><td align="center">11.42</td><td align="center">5.73</td><td align="center">5.73</td><td align="center">8.97</td><td align="center"><ins>5.05</ins></td><td align="center">8.89</td>
</tr>
<tr>
<td align="left">Paraformer</td><td align="center">220</td><td align="center">83.22</td><td align="center">51.97</td><td align="center">70.16</td><td align="center">68.49</td><td align="center">47.67</td><td align="center">79.31</td><td align="center">69.32</td><td align="center">73.64</td><td align="center">89.00</td>
</tr>
<tr>
<td align="left">SenseVoice-small</td><td align="center">234</td><td align="center">21.08</td><td align="center"><ins>6.52</ins></td><td align="center">8.05</td><td align="center"><b>7.34</b></td><td align="center">6.34</td><td align="center">5.74</td><td align="center"><ins>6.65</ins></td><td align="center">6.69</td><td align="center">9.95</td>
<tr>
<td align="left"><b>SenseVoice-s-Yueβ</b></td><td align="center">234</td><td align="center">19.19</td><td align="center">6.71</td><td align="center">6.87</td><td align="center">8.68</td><td align="center"><ins>5.43</ins></td><td align="center">5.24</td><td align="center">6.93</td><td align="center">5.23</td><td align="center">8.63</td>
</tr>
</tr>
<tr>
<td align="left">Dolphin-small</td><td align="center">372</td><td align="center">59.20</td><td align="center">7.38</td><td align="center">39.69</td><td align="center">51.29</td><td align="center">26.39</td><td align="center">7.21</td><td align="center">9.68</td><td align="center">32.32</td><td align="center">58.20</td>
</tr>
<tr>
<td align="left">TeleASR</td><td align="center">700</td><td align="center">37.18</td><td align="center">7.27</td><td align="center">7.02</td><td align="center"><ins>7.88</ins></td><td align="center">6.25</td><td align="center">8.02</td><td align="center"><b>5.98</b></td><td align="center">6.23</td><td align="center">11.33</td>
</tr>
<tr>
<td align="left">Whisper-medium</td><td align="center">769</td><td align="center">75.50</td><td align="center">68.69</td><td align="center">59.44</td><td align="center">62.50</td><td align="center">62.31</td><td align="center">64.41</td><td align="center">80.41</td><td align="center">80.82</td><td align="center">50.96</td>
</tr>
<tr>
<td align="left"><b>Whisper-m-Yueβ</b></td><td align="center">769</td><td align="center">18.69</td><td align="center">6.86</td><td align="center"><ins>6.86</ins></td><td align="center">11.03</td><td align="center">5.49</td><td align="center"><ins>4.70</ins></td><td align="center">8.51</td><td align="center"><ins>5.05</ins></td><td align="center"><ins>8.05</ins></td>
</tr>
<tr>
<td align="left">FireRedASR-AED-L</td><td align="center">1100</td><td align="center">73.70</td><td align="center">18.72</td><td align="center">43.93</td><td align="center">43.33</td><td align="center">34.53</td><td align="center">48.05</td><td align="center">49.99</td><td align="center">55.37</td><td align="center">50.26</td>
</tr>
<tr>
<td align="left">Whisper-large-v3</td><td align="center">1550</td><td align="center">45.09</td><td align="center">15.46</td><td align="center">12.85</td><td align="center">16.36</td><td align="center">14.63</td><td align="center">17.84</td><td align="center">20.70</td><td align="center">12.95</td><td align="center">26.86</td>
</tr>
<tr><td align="left" colspan="11"><b>w/ LLM</b></td></tr>
<tr>
<td align="left">Qwen2.5-Omni-3B</td><td align="center">3000</td><td align="center">72.01</td><td align="center">7.49</td><td align="center">12.59</td><td align="center">11.75</td><td align="center">38.91</td><td align="center">10.59</td><td align="center">25.78</td><td align="center">67.95</td><td align="center">88.46</td>
</tr>
<tr>
<td align="left">Kimi-Audio</td><td align="center">7000</td><td align="center">68.65</td><td align="center">24.34</td><td align="center">40.90</td><td align="center">38.72</td><td align="center">30.72</td><td align="center">44.29</td><td align="center">45.54</td><td align="center">50.86</td><td align="center">33.49</td>
</tr>
<tr>
<td align="left">FireRedASR-LLM-L</td><td align="center">8300</td><td align="center">73.70</td><td align="center">18.72</td><td align="center">43.93</td><td align="center">43.33</td><td align="center">34.53</td><td align="center">48.05</td><td align="center">49.99</td><td align="center">49.87</td><td align="center">45.92</td>
</tr>
<tr>
<td align="left"><b>Conformer-LLM-Yueβ</b></td><td align="center">4200</td><td align="center"><ins>17.22</ins></td><td align="center"><b>6.21</b></td><td align="center"><b>6.23</b></td><td align="center">9.52</td><td align="center"><b>4.35</b></td><td align="center"><b>4.57</b></td><td align="center">6.98</td><td align="center"><b>4.73</b></td><td align="center"><b>7.91</b></td>
</tr>
</table>
## ASR Inference
### U2pp_Conformer_Yue
```
dir=u2pp_conformer_yue
decode_checkpoint=$dir/u2pp_conformer_yue.pt
test_set=path/to/test_set
test_result_dir=path/to/test_result_dir
python wenet/bin/recognize.py \
--gpu 0 \
--modes attention_rescoring \
--config $dir/train.yaml \
--test_data $test_set/data.list \
--checkpoint $decode_checkpoint \
--beam_size 10 \
--batch_size 32 \
--ctc_weight 0.5 \
--result_dir $test_result_dir \
--decoding_chunk_size -1
```
### Whisper_Medium_Yue
```
dir=whisper_medium_yue
decode_checkpoint=$dir/whisper_medium_yue.pt
test_set=path/to/test_set
test_result_dir=path/to/test_result_dir
python wenet/bin/recognize.py \
--gpu 0 \
--modes attention \
--config $dir/train.yaml \
--test_data $test_set/data.list \
--checkpoint $decode_checkpoint \
--beam_size 10 \
--batch_size 32 \
--blank_penalty 0.0 \
--ctc_weight 0.0 \
--reverse_weight 0.0 \
--result_dir $test_result_dir \
--decoding_chunk_size -1
```
### SenseVoice_Small_Yue
```
from funasr import AutoModel
model_dir = "sensevoice_small_yue"
model = AutoModel(
model=model_path,
device="cuda:0",
)
res = model.generate(
wav_path,
cache={},
language="yue",
use_itn=True,
batch_size=64,
)
``` |