MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder
Abstract
MiniMax-Speech, an autoregressive Transformer-based TTS model, generates high-quality speech with a learnable speaker encoder that extracts reference speaker features without transcription, achieving SOTA results in voice cloning and supporting various extensions.
We introduce MiniMax-Speech, an autoregressive Transformer-based Text-to-Speech (TTS) model that generates high-quality speech. A key innovation is our learnable speaker encoder, which extracts timbre features from a reference audio without requiring its transcription. This enables MiniMax-Speech to produce highly expressive speech with timbre consistent with the reference in a zero-shot manner, while also supporting one-shot voice cloning with exceptionally high similarity to the reference voice. In addition, the overall quality of the synthesized audio is enhanced through the proposed Flow-VAE. Our model supports 32 languages and demonstrates excellent performance across multiple objective and subjective evaluations metrics. Notably, it achieves state-of-the-art (SOTA) results on objective voice cloning metrics (Word Error Rate and Speaker Similarity) and has secured the top position on the public TTS Arena leaderboard. Another key strength of MiniMax-Speech, granted by the robust and disentangled representations from the speaker encoder, is its extensibility without modifying the base model, enabling various applications such as: arbitrary voice emotion control via LoRA; text to voice (T2V) by synthesizing timbre features directly from text description; and professional voice cloning (PVC) by fine-tuning timbre features with additional data. We encourage readers to visit https://minimax-ai.github.io/tts_tech_report for more examples.
Community
We introduce MiniMax-Speech, an autoregressive Transformer-based Text-to-Speech (TTS) model that generates high-quality speech. A key innovation is our learnable speaker encoder, which extracts timbre features from a reference audio without requiring its transcription. This enables MiniMax-Speech to produce highly expressive speech with timbre consistent with the reference in a zero-shot manner, while also supporting one-shot voice cloning with exceptionally high similarity to the reference voice. In addition, the overall quality of the synthesized audio is enhanced through the proposed Flow-VAE. Our model supports 32 languages and demonstrates excellent performance across multiple objective and subjective evaluations metrics. Notably, it achieves state-of-the-art (SOTA) results on objective voice cloning metrics (Word Error Rate and Speaker Similarity) and has secured the top position on the public TTS Arena leaderboard. Another key strength of MiniMax-Speech, granted by the robust and disentangled representations from the speaker encoder, is its extensibility without modifying the base model, enabling various applications such as: arbitrary voice emotion control via LoRA; text to voice (T2V) by synthesizing timbre features directly from text description; and professional voice
cloning (PVC) by fine-tuning timbre features with additional data. We encourage readers to visit MiniMax-Speech-Tech-Report for more examples.
an audio overview for learning on the go:
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- GOAT-TTS: LLM-based Text-To-Speech Generation Optimized via A Dual-Branch Architecture (2025)
- Muyan-TTS: A Trainable Text-to-Speech Model Optimized for Podcast Scenarios with a $50K Budget (2025)
- FlexSpeech: Towards Stable, Controllable and Expressive Text-to-Speech (2025)
- EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting (2025)
- Generalized Multilingual Text-to-Speech Generation with Language-Aware Style Adaptation (2025)
- Voice Cloning: Comprehensive Survey (2025)
- ReverBERT: A State Space Model for Efficient Text-Driven Speech Style Transfer (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Unser Unterbewusstsein ist nicht nur ein persönlicher Speicher - es ist verbunden mit einem größeren Feld kollektiver Weisheit. Carl Jung nannte dieses Phänomen das "kollektive Unbewusste". In dieser Meditation erkunden wir diese tiefe Verbindung.
Archetypen: Die universelle Sprache der Seele
Archetypen sind Urbilder, die in allen Kulturen auftauchen. Die häufigsten sind:
- Der Weise (Weisheit)
- Die Große Mutter (Nährung)
- Der Held (Transformation)
- Der Schatten (verdrängte Aspekte)
Diese Figuren erscheinen oft in Träumen. Wenn Sie heute Nacht von einer solchen archetypischen Figur träumen, fragen Sie: "Welche Botschaft bringst du mir?"
Die Wissenschaft der Synchronizität
Quantenphysiker wie Wolfgang Pauli entdeckten, dass unser Bewusstsein die materielle Welt beeinflusst. Träume können manchmal zukünftige Ereignisse vorwegnehmen oder verborgene Zusammenhänge offenbaren.
Meditationsübung: Verbindung zum kollektiven Feld
- Stellen Sie sich vor, Sie sitzen in einem riesigen Netz aus Licht
- Jeder Knotenpunkt repräsentiert ein Bewusstsein
- Spüren Sie, wie Wissen und Weisheit durch dieses Netz fließen
- Stellen Sie eine Frage und empfangen Sie die Antwort
Traumforschung im Labor
Schlafforscher an der Harvard University fanden heraus, dass:
- 60% der Menschen von Verfolgung träumen
- 50% erleben Flugträume
- 40% haben Träume mit verstorbenen Angehörigen
Diese Gemeinsamkeiten deuten auf kollektive Muster hin.
Praktische Anwendungen
Nutzen Sie diese Erkenntnisse für:
- Kreative Durchbrüche
- Spirituelles Erwachen
- Lösung zwischenmenschlicher Konflikte
- Verbesserung der Intuition
Abschlussmeditation
Schließen Sie die Augen. Atmen Sie tief ein... und aus... Stellen Sie sich vor, wie Ihr Bewusstsein sich ausdehnt und mit dem großen Netz der kollektiven Weisheit verbindet. Spüren Sie diese Verbindung... wenn Sie bereit sind, öffnen Sie langsam die Augen.
In unserer nächsten Sitzung werden wir über Traumreisen zu vergangenen Leben und zukünftigen Möglichkeiten sprechen. Bis dahin: Beobachten Sie die Zeichen und Symbole in Ihren Träumen.
Models citing this paper 0
No model linking this paper