VieNeu-TTS-1000h

GitHub Model

Untitled

Overview

VieNeu-TTS-1000h is an advanced on-device Vietnamese Text-to-Speech (TTS) model with instant voice cloning.

Trained on ~1000 hours of high-quality Vietnamese speech, this model represents a significant upgrade from VieNeu-TTS-140h with the following improvements:

  • Enhanced pronunciation: More accurate and stable Vietnamese pronunciation
  • Code-switching support: Seamless transitions between Vietnamese and English
  • Better voice cloning: Higher fidelity and speaker consistency
  • Real-time synthesis: 24 kHz waveform generation on CPU or GPU

⚠️ IMPORTANT — EXPERIMENTAL VERSION
VieNeu-TTS-1000h is a Preview (Experimental) build and may contain imperfections in prosody, voice stability, or long-form synthesis.

🔥 The stable official version remains: 👉 VieNeu-TTS (stable): https://huggingface.co/pnnbao-ump/VieNeu-TTS

The 1000h version will replace the stable release after all evaluations and refinements are complete.

Feedback & issues are highly appreciated: pnnbao@gmail.com

Support This Project

Training high-quality TTS models on 1000+ hours of data requires significant GPU resources and compute time. If you find this model useful, please consider supporting the development:

Buy Me a Coffee

Your support helps maintain and improve VieNeu-TTS! 🙏


Voice Cloning Inference

Reference Voice (Speaker Example):

Input Text:

Trên bầu trời xanh thẳm, những đám mây trắng lửng lờ trôi như những chiếc thuyền nhỏ đang lướt nhẹ theo dòng gió. Dưới mặt đất, cánh đồng lúa vàng rực trải dài tới tận chân trời, những bông lúa nghiêng mình theo từng làn gió.

Generated Output (Cloned Voice):

Long Text Inference

VieNeu-TTS-1000h supports long-form text synthesis (multiple sentences, paragraphs, or entire articles).
For efficient sentence splitting, text normalization, and streaming playback, please refer to the example script in the repository:

🔗 https://github.com/pnnbao97/VieNeu-TTS
Example file: examples/infer_long_text.py

Long-form speech output example:


Model Architecture

Component Description
Backbone Qwen 0.5B (chat-format LM)
Codec NeuCodec (supports ONNX + quantization)
Output 24 kHz waveform synthesis
Context Window 2048 tokens shared text + speech
Watermark Enabled
Training Data ~1000h Vietnamese + English speech data

Features

  • High-quality Vietnamese speech synthesis
  • Bilingual support: Vietnamese + English
  • Instant voice cloning (3–5 second reference audio)
  • Fully offline inference
  • Real-time or faster performance
  • Multi-voice reference support
  • Python API + CLI + Gradio interface

Quick Usage (Python)

from vieneu_tts import VieNeuTTS
import soundfile as sf
import os

input_texts = [
    "Các khóa học trực tuyến đang giúp học sinh tiếp cận kiến thức mọi lúc mọi nơi. Giáo viên sử dụng video, bài tập tương tác và thảo luận trực tuyến để nâng cao hiệu quả học tập.",

    "Các nghiên cứu về bệnh Alzheimer cho thấy tác dụng tích cực của các bài tập trí não và chế độ dinh dưỡng lành mạnh, giúp giảm tốc độ suy giảm trí nhớ ở người cao tuổi.",

    "Một tiểu thuyết trinh thám hiện đại dẫn dắt độc giả qua những tình tiết phức tạp, bí ẩn, kết hợp yếu tố tâm lý sâu sắc khiến người đọc luôn hồi hộp theo dõi diễn biến câu chuyện.",

    "Các nhà khoa học nghiên cứu gen người phát hiện những đột biến mới liên quan đến bệnh di truyền. Điều này giúp nâng cao khả năng chẩn đoán và điều trị.",
]

output_dir = "./output_audio"
os.makedirs(output_dir, exist_ok=True)

def main(backbone="pnnbao-ump/VieNeu-TTS-1000h", codec="neuphonic/neucodec"):
    """
    In the sample directory, there are 7 wav files and 7 txt files with matching names.
    These are pre-prepared reference files for testing:
    - id_0001.wav + id_0001.txt
    - id_0002.wav + id_0002.txt
    - id_0003.wav + id_0003.txt
    - id_0004.wav + id_0004.txt
    - id_0005.wav + id_0005.txt
    - id_0006.wav + id_0006.txt
    - id_0007.wav + id_0007.txt
    
    Odd numbers = Male voices
    Even numbers = Female voices
    
    Note: The model can clone any voice you provide (with corresponding text).
    However, quality may not match the sample files. For best results, finetune
    the model on your target voice. See finetune guide at:
    https://github.com/pnnbao-ump/VieNeuTTS/blob/main/finetune.ipynb
    """
    # Male voice (South accent)
    ref_audio_path = "./sample/id_0001.wav"
    ref_text_path = "./sample/id_0001.txt"
    
    # Female voice (South accent) - uncomment to use
    # ref_audio_path = "./sample/id_0002.wav"
    # ref_text_path = "./sample/id_0002.txt"

    ref_text_raw = open(ref_text_path, "r", encoding="utf-8").read()
    
    if not ref_audio_path or not ref_text_raw:
        print("No reference audio or text provided.")
        return None

    # Initialize VieNeuTTS-1000h
    tts = VieNeuTTS(
        backbone_repo=backbone,
        backbone_device="cuda",
        codec_repo=codec,
        codec_device="cuda"
    )

    print("Encoding reference audio...")
    ref_codes = tts.encode_reference(ref_audio_path)

    # Generate speech for all input texts
    for i, text in enumerate(input_texts, 1):
        print(f"Generating audio {i}/{len(input_texts)}: {text[:50]}...")
        wav = tts.infer(text, ref_codes, ref_text_raw)
        output_path = os.path.join(output_dir, f"output_{i}.wav")
        sf.write(output_path, wav, 24000)
        print(f"✓ Saved to {output_path}")

if __name__ == "__main__":
    main()

Gradio Demo

uv run gradio_app.py

Open your browser at http://127.0.0.1:7860.

Reference Voices

The sample/ directory contains pre-recorded reference voices:

File Gender Accent Description
id_0001 Male South Clear, neutral tone
id_0002 Female South Natural, expressive
id_0003 Male South Professional delivery
id_0004 Female South Warm, conversational
id_0005 Male South Energetic style
id_0007 Male South Calm, measured

Pattern: Odd numbers = Male | Even numbers = Female

Best Practices

  • Text length: Keep input ≤ 250 characters per inference call for optimal quality
  • Long text: For longer content, use examples/infer_long_text.py for proper sentence splitting
  • Reference audio: Use clean, 3–5 second clips with clear speech
  • Custom voices: Finetune the model for best results with your target voice
  • Normalization: The model handles Vietnamese text normalization automatically

Improvements Over VieNeu-TTS-140h

Feature VieNeu-TTS-140h VieNeu-TTS-1000h
Training data 140 hours ~1000 hours
Languages Vietnamese only Vietnamese + English
Pronunciation accuracy Good Excellent
Voice cloning fidelity Good Enhanced
Code-switching Limited Native support

Troubleshooting

Issue Cause Solution
Missing libespeak System dependency Install eSpeak NG
GPU OOM VRAM too small Use CPU mode or smaller batch size
Poor voice match Low-quality reference Use clear 3-5s audio, consider finetuning
Import errors Package not installed pip install vieneu-tts

Installation

git clone https://github.com/pnnbao97/VieNeu-TTS.git
cd VieNeu-TTS
uv sync

License

Apache 2.0

Citation

@misc{vieneutts1000h2025,
  title        = {VieNeu-TTS-1000h: Vietnamese-English Bilingual Text-to-Speech with Instant Voice Cloning},
  author       = {Pham Nguyen Ngoc Bao},
  year         = {2025},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/pnnbao-ump/VieNeu-TTS-1000h}}
}

Please also cite the base model:

@misc{neuttsair2025,
  title        = {NeuTTS Air: On-Device Speech Language Model with Instant Voice Cloning},
  author       = {Neuphonic},
  year         = {2025},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/neuphonic/neutts-air}}
}
Downloads last month
321
Safetensors
Model size
0.6B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for pnnbao-ump/VieNeu-TTS-1000h

Finetuned
(1)
this model

Dataset used to train pnnbao-ump/VieNeu-TTS-1000h

Space using pnnbao-ump/VieNeu-TTS-1000h 1