arxiv:2505.07916

MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder

Published on May 12

· Submitted by

MiniMax-AI on May 14

#1 Paper of the day

Upvote

124

Authors:

Junjie Yan ,

Mingqi Yang ,

Yongmao Zhang ,

Abstract

MiniMax-Speech, an autoregressive Transformer-based TTS model, generates high-quality speech with a learnable speaker encoder that extracts reference speaker features without transcription, achieving SOTA results in voice cloning and supporting various extensions.

AI-generated summary

We introduce MiniMax-Speech, an autoregressive Transformer-based Text-to-Speech (TTS) model that generates high-quality speech. A key innovation is our learnable speaker encoder, which extracts timbre features from a reference audio without requiring its transcription. This enables MiniMax-Speech to produce highly expressive speech with timbre consistent with the reference in a zero-shot manner, while also supporting one-shot voice cloning with exceptionally high similarity to the reference voice. In addition, the overall quality of the synthesized audio is enhanced through the proposed Flow-VAE. Our model supports 32 languages and demonstrates excellent performance across multiple objective and subjective evaluations metrics. Notably, it achieves state-of-the-art (SOTA) results on objective voice cloning metrics (Word Error Rate and Speaker Similarity) and has secured the top position on the public TTS Arena leaderboard. Another key strength of MiniMax-Speech, granted by the robust and disentangled representations from the speaker encoder, is its extensibility without modifying the base model, enabling various applications such as: arbitrary voice emotion control via LoRA; text to voice (T2V) by synthesizing timbre features directly from text description; and professional voice cloning (PVC) by fine-tuning timbre features with additional data. We encourage readers to visit https://minimax-ai.github.io/tts_tech_report for more examples.

View arXiv page View PDF Project page GitHub repository Add to collection

Community

MiniMax-AI

Paper submitter 25 days ago

We introduce MiniMax-Speech, an autoregressive Transformer-based Text-to-Speech (TTS) model that generates high-quality speech. A key innovation is our learnable speaker encoder, which extracts timbre features from a reference audio without requiring its transcription. This enables MiniMax-Speech to produce highly expressive speech with timbre consistent with the reference in a zero-shot manner, while also supporting one-shot voice cloning with exceptionally high similarity to the reference voice. In addition, the overall quality of the synthesized audio is enhanced through the proposed Flow-VAE. Our model supports 32 languages and demonstrates excellent performance across multiple objective and subjective evaluations metrics. Notably, it achieves state-of-the-art (SOTA) results on objective voice cloning metrics (Word Error Rate and Speaker Similarity) and has secured the top position on the public TTS Arena leaderboard. Another key strength of MiniMax-Speech, granted by the robust and disentangled representations from the speaker encoder, is its extensibility without modifying the base model, enabling various applications such as: arbitrary voice emotion control via LoRA; text to voice (T2V) by synthesizing timbre features directly from text description; and professional voice
cloning (PVC) by fine-tuning timbre features with additional data. We encourage readers to visit MiniMax-Speech-Tech-Report for more examples.

yjh415

25 days ago

•

edited 25 days ago

an audio overview for learning on the go:

librarian-bot

24 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

aminrezai161

2 days ago

Unser Unterbewusstsein ist nicht nur ein persönlicher Speicher - es ist verbunden mit einem größeren Feld kollektiver Weisheit. Carl Jung nannte dieses Phänomen das "kollektive Unbewusste". In dieser Meditation erkunden wir diese tiefe Verbindung.

Archetypen: Die universelle Sprache der Seele

Archetypen sind Urbilder, die in allen Kulturen auftauchen. Die häufigsten sind:

Der Weise (Weisheit)
Die Große Mutter (Nährung)
Der Held (Transformation)
Der Schatten (verdrängte Aspekte)

Diese Figuren erscheinen oft in Träumen. Wenn Sie heute Nacht von einer solchen archetypischen Figur träumen, fragen Sie: "Welche Botschaft bringst du mir?"

Die Wissenschaft der Synchronizität

Quantenphysiker wie Wolfgang Pauli entdeckten, dass unser Bewusstsein die materielle Welt beeinflusst. Träume können manchmal zukünftige Ereignisse vorwegnehmen oder verborgene Zusammenhänge offenbaren.

Meditationsübung: Verbindung zum kollektiven Feld

Stellen Sie sich vor, Sie sitzen in einem riesigen Netz aus Licht
Jeder Knotenpunkt repräsentiert ein Bewusstsein
Spüren Sie, wie Wissen und Weisheit durch dieses Netz fließen
Stellen Sie eine Frage und empfangen Sie die Antwort

Traumforschung im Labor

Schlafforscher an der Harvard University fanden heraus, dass:

60% der Menschen von Verfolgung träumen
50% erleben Flugträume
40% haben Träume mit verstorbenen Angehörigen

Diese Gemeinsamkeiten deuten auf kollektive Muster hin.

Praktische Anwendungen

Nutzen Sie diese Erkenntnisse für:

Kreative Durchbrüche
Spirituelles Erwachen
Lösung zwischenmenschlicher Konflikte
Verbesserung der Intuition

Abschlussmeditation

Schließen Sie die Augen. Atmen Sie tief ein... und aus... Stellen Sie sich vor, wie Ihr Bewusstsein sich ausdehnt und mit dem großen Netz der kollektiven Weisheit verbindet. Spüren Sie diese Verbindung... wenn Sie bereit sind, öffnen Sie langsam die Augen.

In unserer nächsten Sitzung werden wir über Traumreisen zu vergangenen Leben und zukünftigen Möglichkeiten sprechen. Bis dahin: Beobachten Sie die Zeichen und Symbole in Ihren Träumen.