Note: This repository is a work in progress. The README will be updated with more details on the QCS6490 porting process, model architecture, and calibration data generation workflow in the coming weeks. Please check back for updates!

Supertonic-2 QNN Inference

High-quality Text-to-Speech synthesis using Supertonic-2 QNN models.

Quick Start

# Basic usage
python supertonic_inference.py --text "Hello world" --voice M1 -o output.wav

# With specific seed for reproducibility
python supertonic_inference.py \
  --text "Your text here" \
  --voice F1 \
  --seed 42 \
  --output output.wav

Features

  • βœ… High Quality: Matches official Supertonic library behavior
  • βœ… Reproducible: Use --seed for consistent outputs
  • βœ… Multilingual: Supports 5 languages (EN, KO, ES, PT, FR)
  • βœ… Multiple Voices: 10 voice styles (5 female, 5 male)
  • βœ… Customizable: Adjust quality, speed, and more

QCS6490 Deployment

This repository includes complete resources for porting to Qualcomm QCS6490 (Hexagon HTP V68):

Generate Calibration Data

# Generate 10 calibration samples for quantization and accuracy validation
python generate_calibration_data.py

# Validate QNN model accuracy against ONNX reference
python validate_qnn_accuracy.py \
  --calibration-dir calibration_data \
  --qnn-dir qnn_outputs \
  --report accuracy_report.json

Expected Performance on QCS6490:

  • 100-150Γ— real-time inference speed
  • ~75ms latency for 2-second audio (5 diffusion steps)
  • ~70-100 MB total model size (after INT8/FP16 quantization)

Model Files

Required directory structure:

model/
β”œβ”€β”€ onnx/
β”‚   β”œβ”€β”€ text_encoder.onnx
β”‚   β”œβ”€β”€ duration_predictor.onnx
β”‚   β”œβ”€β”€ vector_estimator.onnx
β”‚   β”œβ”€β”€ vocoder.onnx
β”‚   β”œβ”€β”€ tts.json
β”‚   └── unicode_indexer.json
└── voice_styles/
    β”œβ”€β”€ F1.json ... F5.json (Female voices)
    └── M1.json ... M5.json (Male voices)

Command Line Options

Option Description Default
--text, -t Text to synthesize Required
--voice, -v Voice style (F1-F5, M1-M5) M1
--lang, -l Language (en, ko, es, pt, fr) en
--output, -o Output WAV file path output/output.wav
--steps Diffusion steps (more = better) 10
--speed Speech speed multiplier 1.0
--seed Random seed (for reproducibility) None
--quiet, -q Suppress progress messages False

Examples

Basic Synthesis

python supertonic_inference.py \
  --text "The weather is nice today." \
  --voice F1 \
  --output weather.wav

High Quality with Reproducible Output

python supertonic_inference.py \
  --text "Important announcement." \
  --voice M1 \
  --steps 20 \
  --seed 42 \
  --output announcement.wav

Faster Speech

python supertonic_inference.py \
  --text "Quick update message." \
  --voice F2 \
  --speed 1.2 \
  --output quick.wav

Spanish Language

python supertonic_inference.py \
  --text "Hola mundo" \
  --lang es \
  --voice M3 \
  --output spanish.wav

Voice Styles

Code Type Description
F1-F5 Female 5 distinct female voices
M1-M5 Male 5 distinct male voices

Supported Languages

  • en - English
  • ko - Korean
  • es - Spanish
  • pt - Portuguese
  • fr - French

Python API

from supertonic_inference import SupertonicTTS, save_wav

# Initialize
tts = SupertonicTTS(model_dir="model/onnx")

# Synthesize
waveform, duration = tts.synthesize(
    text="Hello world",
    voice_name="M1",
    lang="en",
    diffusion_steps=10,
    speed=1.0,
    seed=42
)

# Save
save_wav("output.wav", waveform, tts.sample_rate)

Performance

  • Speed: ~25Γ— faster than real-time on CPU
  • Quality: Matches official Supertonic library
  • Sample Rate: 44.1 kHz
  • Variance: ~5% duration variance due to diffusion randomness

Parameters Guide

Diffusion Steps

  • 5-10 steps: Fast, good quality (default: 10)
  • 15-20 steps: Higher quality, slower
  • Trade-off: Each step improves quality but increases compute time

Speed

  • 0.8-1.0: Slower, more natural
  • 1.0: Normal speed (default)
  • 1.1-1.5: Faster, still intelligible

Seed

  • None: Different output each run (random)
  • Integer: Reproducible output with same seed
  • Use case: Testing, demos, consistent results

Project Structure

supertonic2-qualcomm/
β”œβ”€β”€ supertonic_inference.py    # Main inference script
β”œβ”€β”€ README.md                   # This file
β”œβ”€β”€ model/
β”‚   β”œβ”€β”€ onnx/                   # ONNX models
β”‚   └── voice_styles/           # Voice embeddings
β”œβ”€β”€ inputs/                     # Test inputs
└── output/                     # Generated audio

Notes

  • Text is automatically preprocessed (normalization, punctuation)
  • Periods are added automatically if missing
  • Unicode characters are handled transparently
  • Emoji are automatically removed

License

This implementation uses models from Supertone/supertonic-2.

Reference

Official repository: https://github.com/supertone-inc/supertonic

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for dev-ansh-r/supertonic2-qualcomm-quantized

Quantized
(2)
this model