Needed something small for batched short-form. Low effort v0.3 distillation into base with tokenizer swap and 128 mel bins. Not recommended for long form, dr worse while sr/ir improved, slightly better wer but skips too much text. Short form between small-medium and a little better than moonshine, see table below.
Benchmarks
v0.3 for more
cv8-filtered fixes leakage wrt cv20 2178/4483
| fleurs | cv8 | cv8-filtered | jsut-basic | reazon | bluearchive | nekopara | ||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| cer | sr | ir | dr | cer | sr | ir | dr | cer | sr | ir | dr | cer | sr | ir | dr | cer | sr | ir | dr | cer | sr | ir | dr | cer | sr | ir | dr | |||||||
| tiny_b5_n5 (37.8M) | 34.872 | 25.283 | 7.106 | 2.482 | 34.412 | 24.491 | 7.392 | 2.529 | 36.807 | 25.709 | 8.525 | 2.573 | 31.475 | 23.028 | 6.754 | 1.693 | 53.727 | 26.712 | 20.666 | 6.348 | 32.597 | 17.671 | 12.559 | 2.366 | 60.719 | 32.518 | 21.223 | 6.978 | ||||||
| tiny_b5_n5_nt | 34.645 | 25.366 | 6.818 | 2.461 | 33.071 | 23.777 | 6.459 | 2.835 | 34.886 | 24.820 | 7.166 | 2.900 | 31.048 | 22.804 | 6.418 | 1.826 | 46.508 | 24.779 | 13.174 | 8.555 | 26.572 | 16.250 | 7.372 | 2.950 | 51.144 | 29.551 | 11.354 | 10.239 | ||||||
| base_b5_n5 (73.6M) | 22.811 | 16.903 | 3.855 | 2.053 | 23.723 | 17.283 | 4.268 | 2.173 | 25.653 | 18.559 | 4.760 | 2.334 | 22.426 | 16.994 | 4.012 | 1.420 | 40.646 | 18.565 | 16.556 | 5.526 | 17.958 | 11.096 | 4.815 | 2.046 | 44.135 | 26.304 | 10.779 | 7.052 | ||||||
| base_b5_n5_nt | 22.758 | 16.854 | 3.784 | 2.120 | 22.991 | 16.836 | 3.728 | 2.426 | 24.669 | 17.882 | 4.212 | 2.575 | 22.338 | 16.910 | 3.952 | 1.476 | 35.135 | 17.636 | 11.435 | 6.065 | 16.536 | 10.361 | 3.516 | 2.658 | 40.438 | 24.289 | 6.848 | 9.301 | ||||||
| small_b5_n5 (242M) | 12.177 | 9.052 | 1.707 | 1.419 | 14.460 | 10.320 | 2.276 | 1.865 | 15.731 | 11.007 | 2.758 | 1.966 | 13.856 | 10.593 | 2.114 | 1.149 | 28.569 | 11.154 | 13.162 | 4.253 | 10.923 | 6.613 | 2.718 | 1.593 | 31.519 | 18.896 | 5.427 | 7.196 | ||||||
| small_b5_n5_nt | 12.061 | 8.997 | 1.587 | 1.477 | 14.099 | 10.214 | 1.906 | 1.978 | 15.088 | 10.836 | 2.153 | 2.099 | 13.765 | 10.541 | 2.032 | 1.192 | 25.462 | 10.836 | 9.391 | 5.235 | 10.038 | 6.103 | 1.949 | 1.986 | 29.738 | 17.271 | 3.001 | 9.466 | ||||||
| medium_b5_n5(764M) | 7.259 | 5.218 | 0.959 | 1.082 | 10.925 | 7.694 | 1.455 | 1.776 | 11.869 | 8.344 | 1.555 | 1.970 | 9.628 | 7.200 | 1.428 | 1.000 | 25.426 | 7.871 | 13.965 | 3.590 | 8.756 | 4.996 | 2.415 | 1.345 | 29.385 | 16.406 | 5.659 | 7.320 | ||||||
| medium_b5_n5_nt | 7.244 | 5.240 | 0.910 | 1.094 | 10.647 | 7.506 | 1.282 | 1.859 | 11.474 | 8.055 | 1.332 | 2.088 | 9.598 | 7.200 | 1.349 | 1.048 | 21.381 | 7.791 | 9.150 | 4.440 | 8.306 | 4.700 | 1.684 | 1.922 | 27.300 | 14.658 | 2.833 | 9.808 | ||||||
| moonshine_b1 (27M) | 13.149 | 9.129 | 1.872 | 2.148 | 9.685* | 6.693* | 1.375* | 1.618* | 11.664* | 8.046* | 1.751* | 1.867* | 10.570 | 7.694 | 1.901 | 0.975 | 10.410* | 4.786* | 2.745* | 2.879* | 17.938 | 7.170 | 2.564 | 8.204 | 43.224 | 17.754 | 6.988 | 18.482 | ||||||
| moonshine_b5 | 11.280 | 7.786 | 1.646 | 1.848 | 8.575* | 5.916* | 1.158* | 1.501* | 10.281* | 7.064* | 1.514* | 1.702* | 9.876 | 7.143 | 1.827 | 0.906 | 9.561* | 4.218* | 2.631* | 2.713* | 16.956 | 6.788 | 2.496 | 7.671 | 44.377 | 17.648 | 9.406 | 17.323 | ||||||
| moonshine_b5_n5 | 12.839 | 8.825 | 1.927 | 2.087 | 9.037* | 6.230* | 1.205* | 1.603* | 10.745* | 7.374* | 1.553* | 1.819* | 10.184 | 7.409 | 1.849 | 0.926 | 10.301* | 4.614* | 2.697* | 2.989* | 16.585 | 6.859 | 1.937 | 7.788 | 38.240 | 16.535 | 3.036 | 18.668 | ||||||
| base-ja_b5 (56.6M) | 11.102 | 7.701 | 1.480 | 1.921 | 10.704* | 7.403* | 1.639* | 1.661* | 13.084 | 8.890 | 2.237 | 1.957 | 9.283 | 6.964 | 1.233 | 1.087 | 18.041 | 7.054 | 7.272 | 3.715 | 8.586 | 4.969 | 2.055 | 1.562 | 22.512 | 12.687 | 3.312 | 6.513 | ||||||
| base-ja_b5_n5 | 11.145 | 7.734 | 1.480 | 1.931 | 10.692* | 7.407* | 1.622* | 1.664* | 13.057 | 8.897 | 2.201 | 1.959 | 9.292 | 6.972 | 1.231 | 1.088 | 17.929 | 7.049 | 7.159 | 3.721 | 8.591 | 4.978 | 2.058 | 1.555 | 22.491 | 12.656 | 3.295 | 6.541 | ||||||
| base-ja_b5_nt | 10.379 | 7.354 | 1.474 | 1.551 | 10.668* | 7.391* | 1.598* | 1.679* | 13.054 | 8.911 | 2.160 | 1.984 | 9.259 | 6.959 | 1.223 | 1.077 | 14.905 | 6.257 | 4.711 | 3.938 | 8.338 | 4.810 | 1.903 | 1.625 | 22.226 | 12.473 | 2.946 | 6.806 | ||||||
| base-ja_b5_n5_nt | 10.416 | 7.385 | 1.474 | 1.557 | 10.672* | 7.387* | 1.597* | 1.687* | 13.061 | 8.906 | 2.158 | 1.997 | 9.272 | 6.971 | 1.223 | 1.079 | 14.931 | 6.254 | 4.733 | 3.944 | 8.340 | 4.813 | 1.902 | 1.625 | 22.216 | 12.466 | 2.937 | 6.813 |
Acknowledgements
- Train sets: OOPPEENN, Reazon, 小虫哥_, Common Voice 20, deepghs
- Test sets: kotoba-tech, Saruwatari-lab, grider-withourai
- Downloads last month
- 11