MOSS-TTS Family
Overview
MOSS‑TTS Family is an open‑source speech and sound generation model family from MOSI.AI and the OpenMOSS team. It is designed for high‑fidelity, high‑expressiveness, and complex real‑world scenarios, covering stable long‑form speech, multi‑speaker dialogue, voice/character design, environmental sound effects, and real‑time streaming TTS.
Introduction
When a single piece of audio needs to sound like a real person, pronounce every word accurately, switch speaking styles across content, remain stable over tens of minutes, and support dialogue, role‑play, and real‑time interaction, a single TTS model is often not enough. The MOSS‑TTS Family breaks the workflow into five production‑ready models that can be used independently or composed into a complete pipeline.
- MOSS‑TTS: MOSS-TTS is the flagship production TTS foundation model, centered on high-fidelity zero-shot voice cloning with controllable long-form synthesis, pronunciation, and multilingual/code-switched speech. It serves as the core engine for scalable narration, dubbing, and voice-driven products.
- MOSS‑TTSD: MOSS-TTSD is a production long-form dialogue model for expressive multi-speaker conversational audio at scale. It supports long-duration continuity, turn-taking control, and zero-shot voice cloning from short references for podcasts, audiobooks, commentary, dubbing, and entertainment dialogue.
- MOSS‑VoiceGenerator: MOSS-VoiceGenerator is an open-source voice design model that creates speaker timbres directly from free-form text, without reference audio. It unifies timbre design, style control, and content synthesis, and can be used standalone or as a voice-design layer for downstream TTS.
- MOSS‑SoundEffect: MOSS-SoundEffect is a high-fidelity text-to-sound model with broad category coverage and controllable duration for real content production. It generates stable audio from prompts across ambience, urban scenes, creatures, human actions, and music-like clips for film, games, interactive media, and data synthesis.
- MOSS‑TTS‑Realtime: MOSS-TTS-Realtime is a context-aware, multi-turn streaming TTS model for real-time voice agents. By conditioning on dialogue history across both text and prior user acoustics, it delivers low-latency synthesis with coherent, consistent voice responses across turns.
Released Models
| Model | Architecture | Size | Model Card | Hugging Face |
|---|---|---|---|---|
| MOSS-TTS | MossTTSDelay | 8B | moss_tts_model_card.md | 🤗 Huggingface |
| MossTTSLocal | 1.7B | moss_tts_model_card.md | 🤗 Huggingface | |
| MOSS‑TTSD‑V1.0 | MossTTSDelay | 8B | moss_ttsd_model_card.md | 🤗 Huggingface |
| MOSS‑VoiceGenerator | MossTTSDelay | 1.7B | moss_voice_generator_model_card.md | 🤗 Huggingface |
| MOSS‑SoundEffect | MossTTSDelay | 8B | moss_sound_effect_model_card.md | 🤗 Huggingface |
| MOSS‑TTS‑Realtime | MossTTSRealtime | 1.7B | moss_tts_realtime_model_card.md | 🤗 Huggingface |
MOSS-TTS-Realtime
1. Overview
1.1 TTS Family Positioning
MOSS-TTS-Realtime is a high-performance, real-time speech synthesis model within the broader MOSS TTS Family. It is designed for interactive voice agents that require low-latency, continuous speech generation across multi-turn conversations. Unlike conventional streaming TTS systems that synthesize each response in isolation, MOSS-TTS-Realtime natively models dialogue context by conditioning speech generation on both textual and acoustic information from previous turns. By tightly integrating multi-turn context awareness with incremental streaming synthesis, it produces natural, coherent, and voice-consistent audio responses, enabling fluid and human-like spoken interactions for real-time applications.
Key Capabilities
Context-Aware & Expressive Speech Generation: Generates expressive and coherent speech by modeling both textual and acoustic context across multiple dialogue turns.
High-Fidelity Voice Cloning with Multi-Turn Consistency: Achieves exceptionally high voice similarity while maintaining strong speaker identity consistency across multiple dialogue turns.
Long-Context: Supports long-range context with a maximum context length of 32K (about 40 minutes), enabling stable and consistent speech generation in extended conversations.
Highly Human-Like Speech with Natural Prosody: Trained on over 2.5 million hours of single-speaker speech and more than 1 million hours of two-speaker and multi-speaker conversational data, resulting in highly natural prosody and strong human-like expressiveness.
Multilingual Speech Support: Supports over 10 languages beyond Chinese and English, including Korean, Japanese, German, and French, enabling consistent and expressive speech across languages.
1.2 Model Architecture
2. Quick Start
2.1 Environment Setup
We recommend a clean, isolated Python environment with Transformers 5.0.0 to avoid dependency conflicts.
conda create -n moss-tts python=3.12 -y
conda activate moss-tts
Install all required dependencies:
git clone https://github.com/OpenMOSS/MOSS-TTS.git
cd MOSS-TTS
pip install --extra-index-url https://download.pytorch.org/whl/cu128 -e .
2.2 Usage
Please refer to the following GitHub repository for detailed usage instructions and examples:
👉 Usage Guide:
https://github.com/OpenMOSS/MOSS-TTS/blob/main/docs/moss_tts_realtime_model_card.md
- Downloads last month
- 9