sunnyzjx commited on
Commit
02f4d39
Β·
verified Β·
1 Parent(s): f724223

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +283 -0
README.md CHANGED
@@ -9,5 +9,288 @@ app_file: app.py
9
  pinned: false
10
  short_description: A Framework for Controllable Speech Synthesis
11
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
 
13
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
9
  pinned: false
10
  short_description: A Framework for Controllable Speech Synthesis
11
  ---
12
+ # <img src="./asserts/logo.png" alt="BatonVoice Logo" height="40" style="vertical-align: middle;"> BatonVoice: An Operationalist Framework for Controllable Speech Synthesis
13
+
14
+ [![arXiv](https://img.shields.io/badge/arXiv-Paper-b31b1b.svg)](https://arxiv.org/pdf/2509.26514) <!-- TODO: Replace with your actual arXiv ID -->
15
+ [![Code](https://img.shields.io/badge/GitHub-Code-black.svg)](https://github.com/Tencent/digitalhuman/tree/main/BatonVoice)
16
+ [![Model](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-yellow)](https://huggingface.co/Yue-Wang/BatonTTS-1.7B)
17
+
18
+ This is the official implementation of the paper: **BatonVoice: An Operationalist Framework for Enhancing Controllable Speech Synthesis with Linguistic Intelligence from LLMs**.
19
+
20
+ <img src="./asserts/infer.png" width="500px">
21
+
22
+ ## 🎡 Abstract
23
+
24
+ We propose a new paradigm inspired by "operationalism" that decouples instruction understanding from speech generation.
25
+
26
+ We introduce **BatonVoice**, a framework where a Large Language Model (LLM) acts as a **"conductor"**. The conductor's role is to understand nuanced user instructions and generate a detailed, textual **"plan"**. This plan consists of explicit, word-level vocal features (e.g., pitch, energy, speaking rate).
27
+
28
+ A separate, specialized TTS model, the **"orchestra"**, then executes this plan, generating the final speech directly from these precise features. To realize this component, we developed **BatonTTS**, a 1.7B parameter TTS model trained specifically for this task, which uses Qwen3-1.7B as the backbone and speech tokenizer of CosyVoice2.
29
+
30
+ <img src="./asserts/framework.png" width="500px">
31
+
32
+ ## πŸ›οΈ Framework Overview
33
+
34
+ The BatonVoice framework operates in a simple yet powerful sequence:
35
+
36
+ `[User Instruction] ➑️ [LLM (Conductor)] ➑️ [Textual Plan (Features)] ➑️ [BatonTTS (Orchestra)] ➑️ [Speech Output]`
37
+
38
+ This decoupling allows for unprecedented control and expressiveness, as the complex task of interpretation is handled by a powerful LLM, while the TTS model focuses solely on high-fidelity audio generation based on explicit guidance.
39
+
40
+ ## 🎧 Audio Examples
41
+
42
+ Here are some examples of speech generated by BatonVoice:
43
+
44
+ ### Demo Video
45
+
46
+ <video src="https://github.com/user-attachments/assets/238735ce-3fa8-4b6d-b05b-dfba903b0187" type="video/mp4" width="80%" controls>
47
+ </video>
48
+
49
+
50
+ ### More Samples
51
+
52
+ - **Example 1**: [Listen to example_1.wav](./asserts/example_1.wav)
53
+ - **Example 2**: [Listen to example_2.wav](./asserts/example_2.wav)
54
+ - **Example 3**: [Listen to example_3.wav](./asserts/example_3.wav)
55
+ - **Example 4**: [Listen to example_4.wav](./asserts/example_4.wav)
56
+
57
+ ## πŸš€ Getting Started
58
+
59
+ ### Core Principle: Word-Level Feature Control
60
+
61
+ The core of our framework is the ability to control the synthesized speech through word-level acoustic features. This means you can fine-tune the output by adjusting the specific numerical values for each word or segment.
62
+
63
+ ### Recommended Workflow
64
+
65
+ For the best results, we highly recommend using a powerful, instruction-following LLM to generate the initial feature plan. This significantly reduces the manual effort required.
66
+
67
+ 1. **Generate a Feature Template with an LLM**: Use a powerful LLM like **Gemini 1.5 Pro** to generate a feature plan based on your text and a descriptive prompt (e.g., "in a happy and excited tone").
68
+ * For detailed examples of how to structure these prompts, please refer to our client implementations: `openrouter_gemini_client.py` and `gradio_tts_interface.py`.
69
+ 2. **(Optional) Manually Fine-Tune the Features**: Review the LLM-generated features. You can manually adjust the values for specific words or phrases to achieve the perfect delivery. This is where the true power of BatonVoice lies.
70
+ 3. **Synthesize Speech with BatonTTS**: Feed the final feature plan into the BatonTTS model to generate the audio.
71
+
72
+ ### Alternative Method (Less Recommended)
73
+
74
+ You can also use BatonTTS in a text-only mode to generate both the features and the speech. However, due to the limitations of a smaller model, the generated features often lack variation, resulting in a monotonous voice. We strongly suggest using the LLM-driven workflow for expressive results.
75
+
76
+ ## βš™οΈ Understanding the Features
77
+
78
+ You can control the speech output by adjusting the following features in the plan.
79
+
80
+ | Feature | Description |
81
+ | ------------------- | ------------------------------------------------------------ |
82
+ | `pitch` | The fundamental frequency (F0) of the voice for the segment. Higher values mean a higher-pitched voice. |
83
+ | `pitch_slope` | The rate of change of pitch within the segment. Positive values indicate a rising intonation. |
84
+ | `energy_rms` | The root mean square energy, corresponding to the loudness or volume of the segment. |
85
+ | `energy_slope` | The rate of change of energy. Can be used to create crescendo or decrescendo effects. |
86
+ | `spectral_centroid` | Relates to the "brightness" of the sound. Higher values often sound clearer or sharper. |
87
+
88
+ ### A Special Feature: Word Segmentation
89
+
90
+ The `word` field and the structure of the feature list itself provide powerful control over the rhythm and pacing of the speech.
91
+
92
+ > **Segmentation**: To ensure feature stability and avoid errors from very short segments, the input text is processed into segments of approximately one second or longer. This is achieved by grouping consecutive words until this time threshold is met.
93
+
94
+ This has two important implications:
95
+
96
+ 1. **Speaking Rate**: The number of words in a segment's `'word'` field implicitly indicates the local speaking rate. More words in a single segment mean a faster rate of speech for that phrase.
97
+ 2. **Pauses**: The boundaries between dictionaries in the list can suggest potential pause locations in the synthesized speech. You can create a pause by splitting a sentence into more segments.
98
+
99
+ ## ✨ Examples
100
+
101
+ Let's see how to generate features for the sentence: **"Wow, you really did a great job."** using Gemini 2.5 Pro with different emotional instructions.
102
+
103
+ ### Example 1: Happy Tone
104
+
105
+ ```python
106
+ # Prompt: "Please speak in a happy tone."
107
+ text = "Wow, you really did a great job."
108
+
109
+ feature_plan_happy = [{"word": "Wow, you really","pitch_mean": 360,"pitch_slope": 95,"energy_rms": 0.016,"energy_slope": 60,"spectral_centroid": 2650},{"word": "did a great job.","pitch_mean": 330,"pitch_slope": -80,"energy_rms": 0.014,"energy_slope": -50,"spectral_centroid": 2400}]
110
+ ```
111
+
112
+ 🎡 **Audio Output**: [Listen to happy.wav](./asserts/happy.wav)
113
+
114
+ ### Example 2: Sarcastic Tone
115
+
116
+ ```python
117
+ # Prompt: "Please speak in a sarcastic tone."
118
+ text = "Wow, you really did a great job."
119
+
120
+ feature_plan_sarcastic = [{"word": "wow", "pitch_mean": 271, "pitch_slope": 6, "energy_rms": 0.009, "energy_slope": -4, "spectral_centroid": 2144}, {"word": "you realy", "pitch_mean": 270, "pitch_slope": 195, "energy_rms": 0.01, "energy_slope": 8, "spectral_centroid": 1403}, {"word": "did a great", "pitch_mean": 287, "pitch_slope": 152, "energy_rms": 0.009, "energy_slope": -15, "spectral_centroid": 1920}, {"word": "job", "pitch_mean": 166, "pitch_slope": -20, "energy_rms": 0.004, "energy_slope": -66, "spectral_centroid": 1881}]
121
+ ```
122
+
123
+ 🎡 **Audio Output**: [Listen to sarcastic.wav](./asserts/sarcastic.wav)
124
+
125
+ ## Features
126
+
127
+ ### Core Functionality
128
+
129
+ - **Unified TTS Interface**: Single interface supporting multiple TTS modes
130
+ - **Emotion-Controlled Speech**: Generate speech with specific emotional characteristics
131
+ - **Prosodic Feature Control**: Fine-tune pitch, energy, and spectral features
132
+ - **Audio Feature Extraction**: Extract word-level features from audio files, w
133
+ - **Web-based Interface**: User-friendly Gradio interface for easy interaction
134
+
135
+ ### Four Main Modes
136
+
137
+ 1. **Mode 1: Text + Features to Audio**
138
+ - Input: Text and predefined prosodic features
139
+ - Output: High-quality audio with controlled characteristics
140
+ - Use case: Precise control over speech prosody
141
+
142
+ 2. **Mode 2: Text to Features + Audio**
143
+ - Input: Text only
144
+ - Output: Generated features and corresponding audio
145
+ - Use case: Automatic feature generation with natural speech
146
+
147
+ 3. **Mode 3: Audio to Text Features**
148
+ - Input: Audio file
149
+ - Output: Extracted text and prosodic features
150
+ - Use case: Analysis and feature extraction from existing audio
151
+
152
+ 4. **Mode 4: Text + Instruction to Features**
153
+ - Input: Text and emotional/stylistic instructions
154
+ - Output: AI-generated prosodic features
155
+ - Use case: Emotion-driven feature generation using AI
156
+
157
+ ## Installation
158
+
159
+ ### Prerequisites
160
+
161
+ - Python 3.10
162
+ - CUDA-compatible GPU (recommended)
163
+ - Git with submodule support
164
+
165
+ ### Step-by-Step Installation
166
+
167
+ 1. **Clone the repository with submodules**:
168
+
169
+ ```bash
170
+ git clone --recursive https://github.com/Tencent/digitalhuman.git
171
+ cd digitalhuman/BatonVoice
172
+ ```
173
+
174
+ 2. **Update submodules**:
175
+
176
+ ```bash
177
+ git submodule update --init --recursive
178
+ ```
179
+
180
+ 3. **Create and activate Conda environment**:
181
+
182
+ ```bash
183
+ conda create -n batonvoice -y python=3.10
184
+ conda activate batonvoice
185
+ ```
186
+
187
+ 4. **Install Python dependencies**:
188
+
189
+ ```bash
190
+ pip install -r requirements.txt
191
+ ```
192
+
193
+ 5. **Download the CosyVoice2 model**:
194
+
195
+ ```python
196
+ from modelscope import snapshot_download
197
+ snapshot_download('iic/CosyVoice2-0.5B', local_dir='pretrained_models/CosyVoice2-0.5B')
198
+ ```
199
+
200
+ ## Quick Start
201
+
202
+ ### Command Line Usage
203
+
204
+ #### Basic Text-to-Speech
205
+
206
+ ```bash
207
+ python unified_tts.py --text "Hello world, how are you today?" --output output.wav
208
+ ```
209
+
210
+ #### Text with Custom Features
211
+
212
+ ```bash
213
+ python unified_tts.py --text "Hello world" --features '[{"word": "Hello world", "pitch_mean": 280, "pitch_slope": 50, "energy_rms": 0.006, "energy_slope": 15, "spectral_centroid": 2400}]' --output output.wav
214
+ ```
215
+
216
+ #### Audio Feature Extraction
217
+
218
+ ```bash
219
+ python audio_feature_extractor.py --audio prompt.wav --output features.json
220
+ ```
221
+
222
+ ### Web Interface
223
+
224
+ Launch the Gradio web interface for interactive use:
225
+
226
+ ```bash
227
+ python gradio_tts_interface.py
228
+ ```
229
+
230
+ Then open the provided URL in your browser to access the web interface.
231
+
232
+ ## Project Structure
233
+
234
+ ```
235
+ batonvoice/
236
+ β”œβ”€β”€ unified_tts.py # Main TTS engine with unified interface
237
+ β”œβ”€β”€ gradio_tts_interface.py # Web-based user interface
238
+ β”œβ”€β”€ audio_feature_extractor.py # Audio analysis and feature extraction
239
+ β”œβ”€β”€ openrouter_gemini_client.py # AI-powered feature generation
240
+ β”œβ”€β”€ requirements.txt # Python dependencies
241
+ β”œβ”€β”€ prompt.wav # Default prompt audio file
242
+ β”œβ”€β”€ third-party/ # External dependencies
243
+ β”‚ β”œβ”€β”€ CosyVoice/ # CosyVoice2 TTS model
244
+ β”‚ └── Matcha-TTS/ # Matcha-TTS model
245
+ └── pretrained_models/ # Downloaded model files
246
+ └── CosyVoice2-0.5B/ # CosyVoice2 model directory
247
+ ```
248
+
249
+ ## API Reference
250
+
251
+ ### UnifiedTTS Class
252
+
253
+ ```python
254
+ from unified_tts import UnifiedTTS
255
+
256
+ # Initialize TTS engine
257
+ tts = UnifiedTTS(
258
+ model_path='Yue-Wang/BATONTTS-1.7B',
259
+ cosyvoice_model_dir='./pretrained_models/CosyVoice2-0.5B',
260
+ prompt_audio_path='./prompt.wav'
261
+ )
262
+
263
+ # Mode 1: Text to speech
264
+ tts.text_to_speech("Hello world", "output1.wav")
265
+
266
+ # Mode 2: Text + features to speech
267
+ features = '[{"word": "Hello", "pitch_mean": 300, "pitch_slope": 50, "energy_rms": 0.006, "energy_slope": 15, "spectral_centroid": 2400}]'
268
+ tts.text_features_to_speech("Hello world", features, "output2.wav")
269
+ ```
270
+
271
+ ### AudioFeatureExtractor Class
272
+
273
+ ```python
274
+ from audio_feature_extractor import AudioFeatureExtractor
275
+
276
+ # Initialize extractor
277
+ extractor = AudioFeatureExtractor()
278
+
279
+ # Extract features from audio
280
+ features = extractor.extract_features("input.wav")
281
+ print(features)
282
+ ```
283
+
284
+ ## Acknowledgments
285
+
286
+ - [Qwen3](https://github.com/QwenLM/Qwen3): Powerful LLM Backbone
287
+ - [CosyVoice2](https://github.com/FunAudioLLM/CosyVoice): Advanced TTS model from FunAudioLLM
288
+ - [Matcha-TTS](https://github.com/shivammehta25/Matcha-TTS): High-quality TTS architecture
289
+ - [Whisper](https://github.com/openai/whisper): Speech recognition capabilities
290
+ - [Wav2Vec2](https://github.com/facebookresearch/fairseq/tree/main/examples/wav2vec): Word-level alignment features
291
+
292
+ ---
293
+
294
+ **Note**: For research purposes only. Do not use for commercial or production purposes.
295
 
296
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference