Spaces:
				
			
			
	
			
			
		Sleeping
		
	
	
	
			
			
	
	
	
	
		
		
		Sleeping
		
	| title: BatonVoice | |
| emoji: π | |
| colorFrom: blue | |
| colorTo: indigo | |
| sdk: gradio | |
| sdk_version: 5.47.2 | |
| app_file: app.py | |
| pinned: false | |
| short_description: A Framework for Controllable Speech Synthesis | |
| # <img src="./asserts/logo.png" alt="BatonVoice Logo" height="40" style="vertical-align: middle;"> BatonVoice: An Operationalist Framework for Controllable Speech Synthesis | |
| [](https://arxiv.org/pdf/2509.26514) <!-- TODO: Replace with your actual arXiv ID --> | |
| [](https://github.com/Tencent/digitalhuman/tree/main/BatonVoice) | |
| [](https://huggingface.co/Yue-Wang/BatonTTS-1.7B) | |
| This is the official implementation of the paper: **BatonVoice: An Operationalist Framework for Enhancing Controllable Speech Synthesis with Linguistic Intelligence from LLMs**. | |
| <img src="./asserts/infer.png" width="500px"> | |
| ## π΅ Abstract | |
| We propose a new paradigm inspired by "operationalism" that decouples instruction understanding from speech generation. | |
| We introduce **BatonVoice**, a framework where a Large Language Model (LLM) acts as a **"conductor"**. The conductor's role is to understand nuanced user instructions and generate a detailed, textual **"plan"**. This plan consists of explicit, word-level vocal features (e.g., pitch, energy, speaking rate). | |
| A separate, specialized TTS model, the **"orchestra"**, then executes this plan, generating the final speech directly from these precise features. To realize this component, we developed **BatonTTS**, a 1.7B parameter TTS model trained specifically for this task, which uses Qwen3-1.7B as the backbone and speech tokenizer of CosyVoice2. | |
| <img src="./asserts/framework.png" width="500px"> | |
| ## ποΈ Framework Overview | |
| The BatonVoice framework operates in a simple yet powerful sequence: | |
| `[User Instruction] β‘οΈ [LLM (Conductor)] β‘οΈ [Textual Plan (Features)] β‘οΈ [BatonTTS (Orchestra)] β‘οΈ [Speech Output]` | |
| This decoupling allows for unprecedented control and expressiveness, as the complex task of interpretation is handled by a powerful LLM, while the TTS model focuses solely on high-fidelity audio generation based on explicit guidance. | |
| ## π§ Audio Examples | |
| Here are some examples of speech generated by BatonVoice: | |
| ### Demo Video | |
| <video src="https://github.com/user-attachments/assets/238735ce-3fa8-4b6d-b05b-dfba903b0187" type="video/mp4" width="80%" controls> | |
| </video> | |
| ### More Samples | |
| - **Example 1**: [Listen to example_1.wav](./asserts/example_1.wav) | |
| - **Example 2**: [Listen to example_2.wav](./asserts/example_2.wav) | |
| - **Example 3**: [Listen to example_3.wav](./asserts/example_3.wav) | |
| - **Example 4**: [Listen to example_4.wav](./asserts/example_4.wav) | |
| ## π Getting Started | |
| ### Core Principle: Word-Level Feature Control | |
| The core of our framework is the ability to control the synthesized speech through word-level acoustic features. This means you can fine-tune the output by adjusting the specific numerical values for each word or segment. | |
| ### Recommended Workflow | |
| For the best results, we highly recommend using a powerful, instruction-following LLM to generate the initial feature plan. This significantly reduces the manual effort required. | |
| 1. **Generate a Feature Template with an LLM**: Use a powerful LLM like **Gemini 1.5 Pro** to generate a feature plan based on your text and a descriptive prompt (e.g., "in a happy and excited tone"). | |
| * For detailed examples of how to structure these prompts, please refer to our client implementations: `openrouter_gemini_client.py` and `gradio_tts_interface.py`. | |
| 2. **(Optional) Manually Fine-Tune the Features**: Review the LLM-generated features. You can manually adjust the values for specific words or phrases to achieve the perfect delivery. This is where the true power of BatonVoice lies. | |
| 3. **Synthesize Speech with BatonTTS**: Feed the final feature plan into the BatonTTS model to generate the audio. | |
| ### Alternative Method (Less Recommended) | |
| You can also use BatonTTS in a text-only mode to generate both the features and the speech. However, due to the limitations of a smaller model, the generated features often lack variation, resulting in a monotonous voice. We strongly suggest using the LLM-driven workflow for expressive results. | |
| ## βοΈ Understanding the Features | |
| You can control the speech output by adjusting the following features in the plan. | |
| | Feature | Description | | |
| | ------------------- | ------------------------------------------------------------ | | |
| | `pitch` | The fundamental frequency (F0) of the voice for the segment. Higher values mean a higher-pitched voice. | | |
| | `pitch_slope` | The rate of change of pitch within the segment. Positive values indicate a rising intonation. | | |
| | `energy_rms` | The root mean square energy, corresponding to the loudness or volume of the segment. | | |
| | `energy_slope` | The rate of change of energy. Can be used to create crescendo or decrescendo effects. | | |
| | `spectral_centroid` | Relates to the "brightness" of the sound. Higher values often sound clearer or sharper. | | |
| ### A Special Feature: Word Segmentation | |
| The `word` field and the structure of the feature list itself provide powerful control over the rhythm and pacing of the speech. | |
| > **Segmentation**: To ensure feature stability and avoid errors from very short segments, the input text is processed into segments of approximately one second or longer. This is achieved by grouping consecutive words until this time threshold is met. | |
| This has two important implications: | |
| 1. **Speaking Rate**: The number of words in a segment's `'word'` field implicitly indicates the local speaking rate. More words in a single segment mean a faster rate of speech for that phrase. | |
| 2. **Pauses**: The boundaries between dictionaries in the list can suggest potential pause locations in the synthesized speech. You can create a pause by splitting a sentence into more segments. | |
| ## β¨ Examples | |
| Let's see how to generate features for the sentence: **"Wow, you really did a great job."** using Gemini 2.5 Pro with different emotional instructions. | |
| ### Example 1: Happy Tone | |
| ```python | |
| # Prompt: "Please speak in a happy tone." | |
| text = "Wow, you really did a great job." | |
| feature_plan_happy = [{"word": "Wow, you really","pitch_mean": 360,"pitch_slope": 95,"energy_rms": 0.016,"energy_slope": 60,"spectral_centroid": 2650},{"word": "did a great job.","pitch_mean": 330,"pitch_slope": -80,"energy_rms": 0.014,"energy_slope": -50,"spectral_centroid": 2400}] | |
| ``` | |
| π΅ **Audio Output**: [Listen to happy.wav](./asserts/happy.wav) | |
| ### Example 2: Sarcastic Tone | |
| ```python | |
| # Prompt: "Please speak in a sarcastic tone." | |
| text = "Wow, you really did a great job." | |
| feature_plan_sarcastic = [{"word": "wow", "pitch_mean": 271, "pitch_slope": 6, "energy_rms": 0.009, "energy_slope": -4, "spectral_centroid": 2144}, {"word": "you realy", "pitch_mean": 270, "pitch_slope": 195, "energy_rms": 0.01, "energy_slope": 8, "spectral_centroid": 1403}, {"word": "did a great", "pitch_mean": 287, "pitch_slope": 152, "energy_rms": 0.009, "energy_slope": -15, "spectral_centroid": 1920}, {"word": "job", "pitch_mean": 166, "pitch_slope": -20, "energy_rms": 0.004, "energy_slope": -66, "spectral_centroid": 1881}] | |
| ``` | |
| π΅ **Audio Output**: [Listen to sarcastic.wav](./asserts/sarcastic.wav) | |
| ## Features | |
| ### Core Functionality | |
| - **Unified TTS Interface**: Single interface supporting multiple TTS modes | |
| - **Emotion-Controlled Speech**: Generate speech with specific emotional characteristics | |
| - **Prosodic Feature Control**: Fine-tune pitch, energy, and spectral features | |
| - **Audio Feature Extraction**: Extract word-level features from audio files, w | |
| - **Web-based Interface**: User-friendly Gradio interface for easy interaction | |
| ### Four Main Modes | |
| 1. **Mode 1: Text + Features to Audio** | |
| - Input: Text and predefined prosodic features | |
| - Output: High-quality audio with controlled characteristics | |
| - Use case: Precise control over speech prosody | |
| 2. **Mode 2: Text to Features + Audio** | |
| - Input: Text only | |
| - Output: Generated features and corresponding audio | |
| - Use case: Automatic feature generation with natural speech | |
| 3. **Mode 3: Audio to Text Features** | |
| - Input: Audio file | |
| - Output: Extracted text and prosodic features | |
| - Use case: Analysis and feature extraction from existing audio | |
| 4. **Mode 4: Text + Instruction to Features** | |
| - Input: Text and emotional/stylistic instructions | |
| - Output: AI-generated prosodic features | |
| - Use case: Emotion-driven feature generation using AI | |
| ## Installation | |
| ### Prerequisites | |
| - Python 3.10 | |
| - CUDA-compatible GPU (recommended) | |
| - Git with submodule support | |
| ### Step-by-Step Installation | |
| 1. **Clone the repository with submodules**: | |
| ```bash | |
| git clone --recursive https://github.com/Tencent/digitalhuman.git | |
| cd digitalhuman/BatonVoice | |
| ``` | |
| 2. **Update submodules**: | |
| ```bash | |
| git submodule update --init --recursive | |
| ``` | |
| 3. **Create and activate Conda environment**: | |
| ```bash | |
| conda create -n batonvoice -y python=3.10 | |
| conda activate batonvoice | |
| ``` | |
| 4. **Install Python dependencies**: | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| 5. **Download the CosyVoice2 model**: | |
| ```python | |
| from modelscope import snapshot_download | |
| snapshot_download('iic/CosyVoice2-0.5B', local_dir='pretrained_models/CosyVoice2-0.5B') | |
| ``` | |
| ## Quick Start | |
| ### Command Line Usage | |
| #### Basic Text-to-Speech | |
| ```bash | |
| python unified_tts.py --text "Hello world, how are you today?" --output output.wav | |
| ``` | |
| #### Text with Custom Features | |
| ```bash | |
| python unified_tts.py --text "Hello world" --features '[{"word": "Hello world", "pitch_mean": 280, "pitch_slope": 50, "energy_rms": 0.006, "energy_slope": 15, "spectral_centroid": 2400}]' --output output.wav | |
| ``` | |
| #### Audio Feature Extraction | |
| ```bash | |
| python audio_feature_extractor.py --audio prompt.wav --output features.json | |
| ``` | |
| ### Web Interface | |
| Launch the Gradio web interface for interactive use: | |
| ```bash | |
| python gradio_tts_interface.py | |
| ``` | |
| Then open the provided URL in your browser to access the web interface. | |
| ## Project Structure | |
| ``` | |
| batonvoice/ | |
| βββ unified_tts.py # Main TTS engine with unified interface | |
| βββ gradio_tts_interface.py # Web-based user interface | |
| βββ audio_feature_extractor.py # Audio analysis and feature extraction | |
| βββ openrouter_gemini_client.py # AI-powered feature generation | |
| βββ requirements.txt # Python dependencies | |
| βββ prompt.wav # Default prompt audio file | |
| βββ third-party/ # External dependencies | |
| β βββ CosyVoice/ # CosyVoice2 TTS model | |
| β βββ Matcha-TTS/ # Matcha-TTS model | |
| βββ pretrained_models/ # Downloaded model files | |
| βββ CosyVoice2-0.5B/ # CosyVoice2 model directory | |
| ``` | |
| ## API Reference | |
| ### UnifiedTTS Class | |
| ```python | |
| from unified_tts import UnifiedTTS | |
| # Initialize TTS engine | |
| tts = UnifiedTTS( | |
| model_path='Yue-Wang/BATONTTS-1.7B', | |
| cosyvoice_model_dir='./pretrained_models/CosyVoice2-0.5B', | |
| prompt_audio_path='./prompt.wav' | |
| ) | |
| # Mode 1: Text to speech | |
| tts.text_to_speech("Hello world", "output1.wav") | |
| # Mode 2: Text + features to speech | |
| features = '[{"word": "Hello", "pitch_mean": 300, "pitch_slope": 50, "energy_rms": 0.006, "energy_slope": 15, "spectral_centroid": 2400}]' | |
| tts.text_features_to_speech("Hello world", features, "output2.wav") | |
| ``` | |
| ### AudioFeatureExtractor Class | |
| ```python | |
| from audio_feature_extractor import AudioFeatureExtractor | |
| # Initialize extractor | |
| extractor = AudioFeatureExtractor() | |
| # Extract features from audio | |
| features = extractor.extract_features("input.wav") | |
| print(features) | |
| ``` | |
| ## Acknowledgments | |
| - [Qwen3](https://github.com/QwenLM/Qwen3): Powerful LLM Backbone | |
| - [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice): Advanced TTS model from FunAudioLLM | |
| - [Matcha-TTS](https://github.com/shivammehta25/Matcha-TTS): High-quality TTS architecture | |
| - [Whisper](https://github.com/openai/whisper): Speech recognition capabilities | |
| - [Wav2Vec2](https://github.com/facebookresearch/fairseq/tree/main/examples/wav2vec): Word-level alignment features | |
| --- | |
| **Note**: For research purposes only. Do not use for commercial or production purposes. | |
| Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference | |