---
license: apache-2.0
language:
- en
- de
- ar
- zh
- es
- ko
pipeline_tag: text-to-speech
library_name: transformers
base_model:
- nineninesix/kani-tts-450m-0.2-pt
---
# KaniTTS
[](https://opensource.org/licenses/Apache-2.0)
A high-speed, high-fidelity Text-to-Speech model optimized for real-time conversational AI applications.
## Overview
KaniTTS uses a two-stage pipeline combining a large language model with an efficient audio codec for exceptional speed and audio quality. The architecture generates compressed token representations through a backbone LLM, then rapidly synthesizes waveforms via neural audio codec, achieving extremely low latency.
**Key Specifications:**
- **Model Size:** 370M parameters
- **Sample Rate:** 22kHz
- **Languages:** English, German, Chinese, Korean, Arabic, Spanish
- **License:** Apache 2.0
## Performance
**Nvidia RTX 5080 Benchmarks:**
- **Latency:** ~1 second to generate 15 seconds of audio
- **Memory:** 2GB GPU VRAM
- **Quality Metrics:** MOS 4.3/5 (naturalness), WER <5% (accuracy)
**Pretraining:**
- **Dataset:** ~80k hours from LibriTTS, Common Voice, and Emilia
- **Hardware:** 8x H100 GPUs, 45 hours training time on [Lambda AI](https://lambda.ai/)
**Voices Datasets**
- [https://huggingface.co/datasets/nytopop/expresso-conversational](https://huggingface.co/datasets/nytopop/expresso-conversational)
- [https://huggingface.co/datasets/shb777/gemini-flash-2.0-speech](https://huggingface.co/datasets/shb777/gemini-flash-2.0-speech)
- [https://huggingface.co/datasets/jazza234234/david-dataset](https://huggingface.co/datasets/jazza234234/david-dataset)
- [https://huggingface.co/datasets/reach-vb/jenny_tts_dataset](https://huggingface.co/datasets/reach-vb/jenny_tts_dataset)
- [https://huggingface.co/datasets/MBZUAI/ArVoice](https://huggingface.co/datasets/MBZUAI/ArVoice)
- [https://huggingface.co/datasets/Thorsten-Voice/TV-44kHz-Full](https://huggingface.co/datasets/Thorsten-Voice/TV-44kHz-Full)
- [https://huggingface.co/datasets/SinclairSchneider/german_voice_cb](https://huggingface.co/datasets/SinclairSchneider/german_voice_cb)
- [https://huggingface.co/datasets/Bingsu/KSS_Dataset](https://huggingface.co/datasets/Bingsu/KSS_Dataset)
- [https://huggingface.co/datasets/ciempiess/ciempiess_fem](https://huggingface.co/datasets/ciempiess/ciempiess_fem)
- [https://huggingface.co/datasets/TingChen-ppmc/Shanghai_Dialect_TTS_openai](https://huggingface.co/datasets/TingChen-ppmc/Shanghai_Dialect_TTS_openai)
- [https://huggingface.co/datasets/boniromou/zh-yue-tts-dataset](https://huggingface.co/datasets/boniromou/zh-yue-tts-dataset)
- [https://huggingface.co/datasets/zeeshanparvez/andrew-v3](https://huggingface.co/datasets/zeeshanparvez/andrew-v3)
**Voices:**
- `david` — David, English (British)
- `puck` — Puck, English (Gemini)
- `kore` — Kore, English (Gemini)
- `andrew` — Andrew, English
- `jenny` — Jenny, English (Irish)
- `simon` — Simon, English
- `katie` — Katie, English
- `seulgi` — Seulgi, Korean
- `bert` — Bert, German
- `thorsten` — Thorsten, German (Hessisch)
- `maria` — Maria, Spanish
- `mei` — Mei, Chinese (Cantonese)
- `ming` — Ming, Chinese (Shanghai OpenAI)
- `karim` — Karim, Arabic
- `nur` — Nur, Arabic
## Audio Examples
| Text | Audio |
|---|---|
| I do believe Marsellus Wallace, MY husband, YOUR boss, told you to take me out and do WHATEVER I WANTED. | |
| What do we say to the god of death? Not today! | |
| What do you call a lawyer with an IQ of 60? Your honor | |
| You mean, let me understand this cause, you know maybe it's me, it's a little fucked up maybe, but I'm funny how, I mean funny like I'm a clown, I amuse you? | |
## Use Cases
- **Conversational AI:** Real-time speech for chatbots and virtual assistants
- **Edge/Server Deployment:** Resource-efficient inference on affordable hardware
- **Accessibility:** Screen readers and language learning applications
- **Research:** Fine-tuning for specific voices, accents, or emotions
## Limitations
- Performance degrades with inputs exceeding 2000 tokens
- Limited expressivity without fine-tuning for specific emotions
- May inherit biases from training data in prosody or pronunciation
- Optimized primarily for English; other languages may require additional training
## Optimization Tips
- **Multilingual Performance:** Continually pretrain on target language datasets and fine-tune NanoCodec
- **Batch Processing:** Use batches of 8-16 for high-throughput scenarios
- **Hardware:** Optimized for NVIDIA Blackwell architecture GPUs
## Resources
**Models:**
- [Pretrained Model](https://huggingface.co/nineninesix/kani-tts-450m-0.2-pt)
- [Fine-tuned Model](https://huggingface.co/nineninesix/kani-tts-370m)
- [HuggingFace Space](https://huggingface.co/spaces/nineninesix/KaniTTS)
**Examples:**
- [Inference Example](https://colab.research.google.com/drive/1mvzGs7jtAMSUz8wvNlL5uFmgFEyAPjDh?usp=sharing)
- [Fine-tuning Example](https://colab.research.google.com/drive/1oDIPOSHW2kUoP3CGafvh9lM6j03Z-vE6?usp=sharing)
- [Example Dataset](https://huggingface.co/datasets/nineninesix/expresso-conversational-en-nano-codec-dataset)
- [GitHub Repository](https://github.com/nineninesix-ai/kani-tts)
**Links:**
- [Website](https://www.nineninesix.ai/)
- [Contact Form](https://airtable.com/appX2G2TpoRk4M5Bf/pagO2xbIOjiwulPcP/form)
## Acknowledgments
Built on top of [LiquidAI LFM2 350M](https://huggingface.co/LiquidAI/LFM2-350M) as the backbone and [Nvidia NanoCodec](https://huggingface.co/nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps) for audio processing.
## Responsible Use
**Prohibited activities include:**
- Illegal content or harmful, threatening, defamatory, or obscene material
- Hate speech, harassment, or incitement of violence
- Generating false or misleading information
- Impersonating individuals without consent
- Malicious activities such as spamming, phishing, or fraud
By using this model, you agree to comply with these restrictions and all applicable laws.
## Contact
Have a question, feedback, or need support? Please fill out our [contact form](https://airtable.com/appX2G2TpoRk4M5Bf/pagO2xbIOjiwulPcP/form) and we'll get back to you as soon as possible.
## Citation
```
@misc {sb_2025,
author = { SB },
title = { gemini-flash-2.0-speech },
year = 2025,
url = { https://huggingface.co/datasets/shb777/gemini-flash-2.0-speech },
doi = { 10.57967/hf/4237 },
publisher = { Hugging Face }
}
```
```
@misc{toyin2025arvoicemultispeakerdatasetarabic,
title={ArVoice: A Multi-Speaker Dataset for Arabic Speech Synthesis},
author={Hawau Olamide Toyin and Rufael Marew and Humaid Alblooshi and Samar M. Magdy and Hanan Aldarmaki},
year={2025},
eprint={2505.20506},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.20506},
}
```
```
@misc {thorsten_müller_2024,
author = { {Thorsten Müller} },
title = { TV-44kHz-Full (Revision ff427ec) },
year = 2024,
url = { https://huggingface.co/datasets/Thorsten-Voice/TV-44kHz-Full },
doi = { 10.57967/hf/3290 },
publisher = { Hugging Face }
}
```
```
@misc{carlosmenaciempiessfem2019,
title={CIEMPIESS FEM CORPUS: Audio and Transcripts of Female Speakers in Spanish.},
ldc_catalog_no={LDC2019S07},
DOI={https://doi.org/10.35111/xdx5-n815},
author={Hernandez Mena, Carlos Daniel},
journal={Linguistic Data Consortium, Philadelphia},
year={2019},
url={https://catalog.ldc.upenn.edu/LDC2019S07},
}
```