MaskGCT

This is a mirror of the original weights for use with TTSDB.

Original weights: https://huggingface.co/amphion/MaskGCT Original code: https://github.com/open-mmlab/Amphion

MaskGCT is a Non-Autoregressive Masked Transformer text-to-speech model supporting English, Chinese, Korean, Japanese, French, German.

Original Work

This model was created by the original authors. Please cite their work if you use this model:

@article{wang2024maskgct,
  title={MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer},
  author={Wang, Yuancheng and Zhan, Haoyue and Liu, Liwei and Zeng, Ruihong and Guo, Haotian and Zheng, Jiachen and Zhang, Qiang and Zhang, Xueyao and Zhang, Shunsi and Wu, Zhizheng},
  journal={arXiv preprint arXiv:2409.00750},
  year={2024}
}

@inproceedings{amphion,
  author={Zhang, Xueyao and Xue, Liumeng and Gu, Yicheng and Wang, Yuancheng and Li, Jiaqi and He, Haorui and Wang, Chaoren and Song, Ting and Chen, Xi and Fang, Zihao and Chen, Haopeng and Zhang, Junan and Tang, Tze Ying and Zou, Lexiao and Wang, Mingxuan and Han, Jun and Chen, Kai and Li, Haizhou and Wu, Zhizheng},
  title={Amphion: An Open-Source Audio, Music and Speech Generation Toolkit},
  booktitle={{IEEE} Spoken Language Technology Workshop, {SLT} 2024},
  year={2024}
}

Papers:

Installation

pip install ttsdb-maskgct

Usage

from ttsdb_maskgct import MaskGCT

# Load the model (downloads weights automatically)
model = MaskGCT(model_id="ttsds/MaskGCT")

# Synthesize speech
audio, sample_rate = model.synthesize(
    text="Hello, this is a test of MaskGCT.",
    reference_audio="path/to/reference.wav",
    text_reference="Transcript of the reference audio.",
    language="en",
)

# Save the output
model.save_audio(audio, sample_rate, "output.wav")

Model Details

Property	Value
Sample Rate	24000 Hz
Parameters	1010M
Architecture	Non-Autoregressive Masked Transformer
Languages	English, Chinese, Korean, Japanese, French, German
Release Date	2024-10-17

Training Data

Emilia Dataset (100000 hours)

License

Weights: Creative Commons Attribution-NonCommercial 4.0
Code: MIT License

Please refer to the original repositories for full license terms.

Paper for ttsds/maskgct

MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer

Paper • 2409.00750 • Published Sep 1, 2024 • 5

ttsds
/

maskgct