Text-to-Speech
dual_ar
openaudio-s1-mini / README.md
lengyue233's picture
Update README.md
04f404c verified
metadata
tags:
  - text-to-speech
license: cc-by-nc-sa-4.0
language:
  - zh
  - en
  - de
  - ja
  - fr
  - es
  - ko
  - ar
  - nl
  - ru
  - it
  - pl
  - pt
pipeline_tag: text-to-speech
inference: false
extra_gated_prompt: >-
  You agree to not use the model to generate contents that violate DMCA or local
  laws.
extra_gated_fields:
  Country: country
  Specific date: date_picker
  I agree to use this model for non-commercial use ONLY: checkbox

OpenAudio S1

OpenAudio S1 is a leading text-to-speech (TTS) model trained on more than 2 million hours of audio data in multiple languages.

Supported languages:

  • English (en)
  • Chinese (zh)
  • Japanese (ja)
  • German (de)
  • French (fr)
  • Spanish (es)
  • Korean (ko)
  • Arabic (ar)
  • Russian (ru)
  • Dutch (nl)
  • Italian (it)
  • Polish (pl)
  • Portuguese (pt)

Please refer to Fish Speech Github for more info. Demo available at Fish Audio Playground. Visit the OpenAudio website for blog & tech report.

Emotion and Tone Support

OpenAudio S1 supports a variety of emotional, tone, and special markers to enhance speech synthesis:

1. Emotional markers: (angry) (sad) (disdainful) (excited) (surprised) (satisfied) (unhappy) (anxious) (hysterical) (delighted) (scared) (worried) (indifferent) (upset) (impatient) (nervous) (guilty) (scornful) (frustrated) (depressed) (panicked) (furious) (empathetic) (embarrassed) (reluctant) (disgusted) (keen) (moved) (proud) (relaxed) (grateful) (confident) (interested) (curious) (confused) (joyful) (disapproving) (negative) (denying) (astonished) (serious) (sarcastic) (conciliative) (comforting) (sincere) (sneering) (hesitating) (yielding) (painful) (awkward) (amused)

2. Tone markers: (in a hurry tone) (shouting) (screaming) (whispering) (soft tone)

3. Special markers: (laughing) (chuckling) (sobbing) (crying loudly) (sighing) (panting) (groaning) (crowd laughing) (background laughter) (audience laughing)

Special markers with corresponding onomatopoeia:

  • Laughing: Ha,ha,ha
  • Chuckling: Hmm,hmm

Model Variants and Performance

OpenAudio S1 includes the following models:

  • S1 (4B, proprietary): The full-sized model.
  • S1-mini (0.5B): A distilled version of S1.

Both S1 and S1-mini incorporate online Reinforcement Learning from Human Feedback (RLHF).

Seed TTS Eval Metrics (English, auto eval, based on OpenAI gpt-4o-transcribe, speaker distance using Revai/pyannote-wespeaker-voxceleb-resnet34-LM):

  • S1:
    • WER (Word Error Rate): 0.008
    • CER (Character Error Rate): 0.004
    • Distance: 0.332
  • S1-mini:
    • WER (Word Error Rate): 0.011
    • CER (Character Error Rate): 0.005
    • Distance: 0.380

License

This model is permissively licensed under the CC-BY-NC-SA-4.0 license.