---
title: BatonVoice
emoji: 🏆
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 5.47.2
app_file: app.py
pinned: false
short_description: A Framework for Controllable Speech Synthesis
---
# <img src="./asserts/logo.png" alt="BatonVoice Logo" height="40" style="vertical-align: middle;"> BatonVoice: An Operationalist Framework for Controllable Speech Synthesis

[![arXiv](https://img.shields.io/badge/arXiv-Paper-b31b1b.svg)](https://arxiv.org/pdf/2509.26514) <!-- TODO: Replace with your actual arXiv ID -->
[![Code](https://img.shields.io/badge/GitHub-Code-black.svg)](https://github.com/Tencent/digitalhuman/tree/main/BatonVoice)
[![Model](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-yellow)](https://huggingface.co/Yue-Wang/BatonTTS-1.7B)

This is the official implementation of the paper: **BatonVoice: An Operationalist Framework for Enhancing Controllable Speech Synthesis with Linguistic Intelligence from LLMs**.

<img src="./asserts/infer.png" width="500px">

## 🎵 Abstract

We propose a new paradigm inspired by "operationalism" that decouples instruction understanding from speech generation.

We introduce **BatonVoice**, a framework where a Large Language Model (LLM) acts as a **"conductor"**. The conductor's role is to understand nuanced user instructions and generate a detailed, textual **"plan"**. This plan consists of explicit, word-level vocal features (e.g., pitch, energy, speaking rate).

A separate, specialized TTS model, the **"orchestra"**, then executes this plan, generating the final speech directly from these precise features. To realize this component, we developed **BatonTTS**, a 1.7B parameter TTS model trained specifically for this task, which uses Qwen3-1.7B as the backbone and speech tokenizer of CosyVoice2.

<img src="./asserts/framework.png" width="500px">

## 🏛️ Framework Overview

The BatonVoice framework operates in a simple yet powerful sequence:

`[User Instruction] ➡️ [LLM (Conductor)] ➡️ [Textual Plan (Features)] ➡️ [BatonTTS (Orchestra)] ➡️ [Speech Output]`

This decoupling allows for unprecedented control and expressiveness, as the complex task of interpretation is handled by a powerful LLM, while the TTS model focuses solely on high-fidelity audio generation based on explicit guidance.

## 🎧 Audio Examples

Here are some examples of speech generated by BatonVoice:

### Demo Video

<video src="https://github.com/user-attachments/assets/238735ce-3fa8-4b6d-b05b-dfba903b0187" type="video/mp4" width="80%" controls>
</video>


### More Samples

- **Example 1**: [Listen to example_1.wav](./asserts/example_1.wav)
- **Example 2**: [Listen to example_2.wav](./asserts/example_2.wav)
- **Example 3**: [Listen to example_3.wav](./asserts/example_3.wav)
- **Example 4**: [Listen to example_4.wav](./asserts/example_4.wav)

## 🚀 Getting Started

### Core Principle: Word-Level Feature Control

The core of our framework is the ability to control the synthesized speech through word-level acoustic features. This means you can fine-tune the output by adjusting the specific numerical values for each word or segment.

### Recommended Workflow

For the best results, we highly recommend using a powerful, instruction-following LLM to generate the initial feature plan. This significantly reduces the manual effort required.

1.  **Generate a Feature Template with an LLM**: Use a powerful LLM like **Gemini 1.5 Pro** to generate a feature plan based on your text and a descriptive prompt (e.g., "in a happy and excited tone").
    *   For detailed examples of how to structure these prompts, please refer to our client implementations: `openrouter_gemini_client.py` and `gradio_tts_interface.py`.
2.  **(Optional) Manually Fine-Tune the Features**: Review the LLM-generated features. You can manually adjust the values for specific words or phrases to achieve the perfect delivery. This is where the true power of BatonVoice lies.
3.  **Synthesize Speech with BatonTTS**: Feed the final feature plan into the BatonTTS model to generate the audio.

### Alternative Method (Less Recommended)

You can also use BatonTTS in a text-only mode to generate both the features and the speech. However, due to the limitations of a smaller model, the generated features often lack variation, resulting in a monotonous voice. We strongly suggest using the LLM-driven workflow for expressive results.

## ⚙️ Understanding the Features

You can control the speech output by adjusting the following features in the plan.

| Feature             | Description                                                  |
| ------------------- | ------------------------------------------------------------ |
| `pitch`             | The fundamental frequency (F0) of the voice for the segment. Higher values mean a higher-pitched voice. |
| `pitch_slope`       | The rate of change of pitch within the segment. Positive values indicate a rising intonation. |
| `energy_rms`        | The root mean square energy, corresponding to the loudness or volume of the segment. |
| `energy_slope`      | The rate of change of energy. Can be used to create crescendo or decrescendo effects. |
| `spectral_centroid` | Relates to the "brightness" of the sound. Higher values often sound clearer or sharper. |

### A Special Feature: Word Segmentation

The `word` field and the structure of the feature list itself provide powerful control over the rhythm and pacing of the speech.

> **Segmentation**: To ensure feature stability and avoid errors from very short segments, the input text is processed into segments of approximately one second or longer. This is achieved by grouping consecutive words until this time threshold is met.

This has two important implications:

1.  **Speaking Rate**: The number of words in a segment's `'word'` field implicitly indicates the local speaking rate. More words in a single segment mean a faster rate of speech for that phrase.
2.  **Pauses**: The boundaries between dictionaries in the list can suggest potential pause locations in the synthesized speech. You can create a pause by splitting a sentence into more segments.

## ✨ Examples

Let's see how to generate features for the sentence: **"Wow, you really did a great job."** using Gemini 2.5 Pro with different emotional instructions.

### Example 1: Happy Tone

```python
# Prompt: "Please speak in a happy tone."
text = "Wow, you really did a great job."

feature_plan_happy = [{"word": "Wow, you really","pitch_mean": 360,"pitch_slope": 95,"energy_rms": 0.016,"energy_slope": 60,"spectral_centroid": 2650},{"word": "did a great job.","pitch_mean": 330,"pitch_slope": -80,"energy_rms": 0.014,"energy_slope": -50,"spectral_centroid": 2400}]
```

🎵 **Audio Output**: [Listen to happy.wav](./asserts/happy.wav)

### Example 2: Sarcastic Tone

```python
# Prompt: "Please speak in a sarcastic tone."
text = "Wow, you really did a great job."

feature_plan_sarcastic = [{"word": "wow", "pitch_mean": 271, "pitch_slope": 6, "energy_rms": 0.009, "energy_slope": -4, "spectral_centroid": 2144}, {"word": "you realy", "pitch_mean": 270, "pitch_slope": 195, "energy_rms": 0.01, "energy_slope": 8, "spectral_centroid": 1403}, {"word": "did a great", "pitch_mean": 287, "pitch_slope": 152, "energy_rms": 0.009, "energy_slope": -15, "spectral_centroid": 1920}, {"word": "job", "pitch_mean": 166, "pitch_slope": -20, "energy_rms": 0.004, "energy_slope": -66, "spectral_centroid": 1881}]
```

🎵 **Audio Output**: [Listen to sarcastic.wav](./asserts/sarcastic.wav)

## Features

### Core Functionality

- **Unified TTS Interface**: Single interface supporting multiple TTS modes
- **Emotion-Controlled Speech**: Generate speech with specific emotional characteristics
- **Prosodic Feature Control**: Fine-tune pitch, energy, and spectral features
- **Audio Feature Extraction**: Extract word-level features from audio files, w
- **Web-based Interface**: User-friendly Gradio interface for easy interaction

### Four Main Modes

1. **Mode 1: Text + Features to Audio**
   - Input: Text and predefined prosodic features
   - Output: High-quality audio with controlled characteristics
   - Use case: Precise control over speech prosody

2. **Mode 2: Text to Features + Audio**
   - Input: Text only
   - Output: Generated features and corresponding audio
   - Use case: Automatic feature generation with natural speech

3. **Mode 3: Audio to Text Features**
   - Input: Audio file
   - Output: Extracted text and prosodic features
   - Use case: Analysis and feature extraction from existing audio

4. **Mode 4: Text + Instruction to Features**
   - Input: Text and emotional/stylistic instructions
   - Output: AI-generated prosodic features
   - Use case: Emotion-driven feature generation using AI

## Installation

### Prerequisites

- Python 3.10
- CUDA-compatible GPU (recommended)
- Git with submodule support

### Step-by-Step Installation

1. **Clone the repository with submodules**:

   ```bash
   git clone --recursive https://github.com/Tencent/digitalhuman.git
   cd digitalhuman/BatonVoice
   ```

2. **Update submodules**:

   ```bash
   git submodule update --init --recursive
   ```

3. **Create and activate Conda environment**:

   ```bash
   conda create -n batonvoice -y python=3.10
   conda activate batonvoice
   ```

4. **Install Python dependencies**:

   ```bash
   pip install -r requirements.txt
   ```

5. **Download the CosyVoice2 model**:

   ```python
   from modelscope import snapshot_download
   snapshot_download('iic/CosyVoice2-0.5B', local_dir='pretrained_models/CosyVoice2-0.5B')
   ```

## Quick Start

### Command Line Usage

#### Basic Text-to-Speech

```bash
python unified_tts.py --text "Hello world, how are you today?" --output output.wav
```

#### Text with Custom Features

```bash
python unified_tts.py --text "Hello world" --features '[{"word": "Hello world", "pitch_mean": 280, "pitch_slope": 50, "energy_rms": 0.006, "energy_slope": 15, "spectral_centroid": 2400}]' --output output.wav
```

#### Audio Feature Extraction

```bash
python audio_feature_extractor.py --audio prompt.wav --output features.json
```

### Web Interface

Launch the Gradio web interface for interactive use:

```bash
python gradio_tts_interface.py
```

Then open the provided URL in your browser to access the web interface.

## Project Structure

```
batonvoice/
├── unified_tts.py              # Main TTS engine with unified interface
├── gradio_tts_interface.py     # Web-based user interface
├── audio_feature_extractor.py  # Audio analysis and feature extraction
├── openrouter_gemini_client.py # AI-powered feature generation
├── requirements.txt            # Python dependencies
├── prompt.wav                  # Default prompt audio file
├── third-party/               # External dependencies
│   ├── CosyVoice/             # CosyVoice2 TTS model
│   └── Matcha-TTS/            # Matcha-TTS model
└── pretrained_models/         # Downloaded model files
    └── CosyVoice2-0.5B/       # CosyVoice2 model directory
```

## API Reference

### UnifiedTTS Class

```python
from unified_tts import UnifiedTTS

# Initialize TTS engine
tts = UnifiedTTS(
    model_path='Yue-Wang/BATONTTS-1.7B',
    cosyvoice_model_dir='./pretrained_models/CosyVoice2-0.5B',
    prompt_audio_path='./prompt.wav'
)

# Mode 1: Text to speech
tts.text_to_speech("Hello world", "output1.wav")

# Mode 2: Text + features to speech
features = '[{"word": "Hello", "pitch_mean": 300, "pitch_slope": 50, "energy_rms": 0.006, "energy_slope": 15, "spectral_centroid": 2400}]'
tts.text_features_to_speech("Hello world", features, "output2.wav")
```

### AudioFeatureExtractor Class

```python
from audio_feature_extractor import AudioFeatureExtractor

# Initialize extractor
extractor = AudioFeatureExtractor()

# Extract features from audio
features = extractor.extract_features("input.wav")
print(features)
```

## Acknowledgments

- [Qwen3](https://github.com/QwenLM/Qwen3): Powerful LLM Backbone
- [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice): Advanced TTS model from FunAudioLLM
- [Matcha-TTS](https://github.com/shivammehta25/Matcha-TTS): High-quality TTS architecture
- [Whisper](https://github.com/openai/whisper): Speech recognition capabilities
- [Wav2Vec2](https://github.com/facebookresearch/fairseq/tree/main/examples/wav2vec): Word-level alignment features

---

**Note**: For research purposes only. Do not use for commercial or production purposes.

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference