SingingSDS / README.md
jhansss's picture
Merge branch 'refactor' into hf
24db250
---
title: SingingSDS
emoji: 🎢
colorFrom: pink
colorTo: yellow
sdk: gradio
sdk_version: 5.4.0
app_file: app.py
pinned: false
python_version: 3.11
---
# SingingSDS: Role-Playing Singing Spoken Dialogue System
A role-playing singing dialogue system that converts speech input into character-based singing output.
## Installation
### Requirements
- Python 3.11+
- CUDA (optional, for GPU acceleration)
### Install Dependencies
#### Option 1: Using Conda (Recommended)
```bash
conda create -n singingsds python=3.11
conda activate singingsds
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt
```
#### Option 2: Using pip only
```bash
pip install -r requirements.txt
```
#### Option 3: Using pip with virtual environment
```bash
python -m venv singingsds_env
# On Windows:
singingsds_env\Scripts\activate
# On macOS/Linux:
source singingsds_env/bin/activate
pip install -r requirements.txt
```
## Usage
### Command Line Interface (CLI)
#### Example Usage
```bash
python cli.py --query_audio tests/audio/hello.wav --config_path config/cli/yaoyin_default.yaml --output_audio outputs/yaoyin_hello.wav
```
#### Parameter Description
- `--query_audio`: Input audio file path (required)
- `--config_path`: Configuration file path (default: config/cli/yaoyin_default.yaml)
- `--output_audio`: Output audio file path (required)
### Web Interface (Gradio)
Start the web interface:
```bash
python app.py
```
Then visit the displayed address in your browser to use the graphical interface.
## Configuration
### Character Configuration
The system supports multiple preset characters:
- **Yaoyin (ι₯音)**: Default timbre is `timbre2`
- **Limei (δΈ½ζ’…)**: Default timbre is `timbre1`
### Model Configuration
#### ASR Models
- `openai/whisper-large-v3-turbo`
- `openai/whisper-large-v3`
- `openai/whisper-medium`
- `sanchit-gandhi/whisper-small-dv`
- `facebook/wav2vec2-base-960h`
#### LLM Models
- `google/gemma-2-2b`
- `MiniMaxAI/MiniMax-M1-80k`
- `meta-llama/Llama-3.2-3B-Instruct`
#### SVS Models
- `espnet/mixdata_svs_visinger2_spkemb_lang_pretrained_avg` (Bilingual)
- `espnet/aceopencpop_svs_visinger2_40singer_pretrain` (Chinese)
## Project Structure
```
SingingSDS/
β”œβ”€β”€ cli.py # Command line interface
β”œβ”€β”€ interface.py # Gradio interface
β”œβ”€β”€ pipeline.py # Core processing pipeline
β”œβ”€β”€ app.py # Web application entry
β”œβ”€β”€ requirements.txt # Python dependencies
β”œβ”€β”€ config/ # Configuration files
β”‚ β”œβ”€β”€ cli/ # CLI-specific configuration
β”‚ └── interface/ # Interface-specific configuration
β”œβ”€β”€ modules/ # Core modules
β”‚ β”œβ”€β”€ asr.py # Speech recognition module
β”‚ β”œβ”€β”€ llm.py # Large language model module
β”‚ β”œβ”€β”€ melody.py # Melody control module
β”‚ β”œβ”€β”€ svs/ # Singing voice synthesis modules
β”‚ β”‚ β”œβ”€β”€ base.py # Base SVS class
β”‚ β”‚ β”œβ”€β”€ espnet.py # ESPnet SVS implementation
β”‚ β”‚ β”œβ”€β”€ registry.py # SVS model registry
β”‚ β”‚ └── __init__.py # SVS module initialization
β”‚ └── utils/ # Utility modules
β”‚ β”œβ”€β”€ g2p.py # Grapheme-to-phoneme conversion
β”‚ β”œβ”€β”€ text_normalize.py # Text normalization
β”‚ └── resources/ # Utility resources
β”œβ”€β”€ characters/ # Character definitions
β”‚ β”œβ”€β”€ base.py # Base character class
β”‚ β”œβ”€β”€ Limei.py # Limei character definition
β”‚ β”œβ”€β”€ Yaoyin.py # Yaoyin character definition
β”‚ └── __init__.py # Character module initialization
β”œβ”€β”€ evaluation/ # Evaluation modules
β”‚ └── svs_eval.py # SVS evaluation metrics
β”œβ”€β”€ data/ # Data directory
β”‚ β”œβ”€β”€ kising/ # Kising dataset
β”‚ └── touhou/ # Touhou dataset
β”œβ”€β”€ resources/ # Project resources
β”œβ”€β”€ data_handlers/ # Data handling utilities
β”œβ”€β”€ assets/ # Static assets
└── tests/ # Test files
```
## Contributing
Issues and Pull Requests are welcome!
## License