Spaces:
Running
Running
A newer version of the Gradio SDK is available:
5.39.0
metadata
title: SingingSDS
emoji: πΆ
colorFrom: pink
colorTo: yellow
sdk: gradio
sdk_version: 5.4.0
app_file: app.py
pinned: false
python_version: 3.11
SingingSDS: Role-Playing Singing Spoken Dialogue System
A role-playing singing dialogue system that converts speech input into character-based singing output.
Installation
Requirements
- Python 3.11+
- CUDA (optional, for GPU acceleration)
Install Dependencies
Option 1: Using Conda (Recommended)
conda create -n singingsds python=3.11
conda activate singingsds
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt
Option 2: Using pip only
pip install -r requirements.txt
Option 3: Using pip with virtual environment
python -m venv singingsds_env
# On Windows:
singingsds_env\Scripts\activate
# On macOS/Linux:
source singingsds_env/bin/activate
pip install -r requirements.txt
Usage
Command Line Interface (CLI)
Example Usage
python cli.py --query_audio tests/audio/hello.wav --config_path config/cli/yaoyin_default.yaml --output_audio outputs/yaoyin_hello.wav
Parameter Description
--query_audio
: Input audio file path (required)--config_path
: Configuration file path (default: config/cli/yaoyin_default.yaml)--output_audio
: Output audio file path (required)
Web Interface (Gradio)
Start the web interface:
python app.py
Then visit the displayed address in your browser to use the graphical interface.
Configuration
Character Configuration
The system supports multiple preset characters:
- Yaoyin (ι₯ι³): Default timbre is
timbre2
- Limei (δΈ½ζ’
): Default timbre is
timbre1
Model Configuration
ASR Models
openai/whisper-large-v3-turbo
openai/whisper-large-v3
openai/whisper-medium
sanchit-gandhi/whisper-small-dv
facebook/wav2vec2-base-960h
LLM Models
google/gemma-2-2b
MiniMaxAI/MiniMax-M1-80k
meta-llama/Llama-3.2-3B-Instruct
SVS Models
espnet/mixdata_svs_visinger2_spkemb_lang_pretrained_avg
(Bilingual)espnet/aceopencpop_svs_visinger2_40singer_pretrain
(Chinese)
Project Structure
SingingSDS/
βββ cli.py # Command line interface
βββ interface.py # Gradio interface
βββ pipeline.py # Core processing pipeline
βββ app.py # Web application entry
βββ requirements.txt # Python dependencies
βββ config/ # Configuration files
β βββ cli/ # CLI-specific configuration
β βββ interface/ # Interface-specific configuration
βββ modules/ # Core modules
β βββ asr.py # Speech recognition module
β βββ llm.py # Large language model module
β βββ melody.py # Melody control module
β βββ svs/ # Singing voice synthesis modules
β β βββ base.py # Base SVS class
β β βββ espnet.py # ESPnet SVS implementation
β β βββ registry.py # SVS model registry
β β βββ __init__.py # SVS module initialization
β βββ utils/ # Utility modules
β βββ g2p.py # Grapheme-to-phoneme conversion
β βββ text_normalize.py # Text normalization
β βββ resources/ # Utility resources
βββ characters/ # Character definitions
β βββ base.py # Base character class
β βββ Limei.py # Limei character definition
β βββ Yaoyin.py # Yaoyin character definition
β βββ __init__.py # Character module initialization
βββ evaluation/ # Evaluation modules
β βββ svs_eval.py # SVS evaluation metrics
βββ data/ # Data directory
β βββ kising/ # Kising dataset
β βββ touhou/ # Touhou dataset
βββ resources/ # Project resources
βββ data_handlers/ # Data handling utilities
βββ assets/ # Static assets
βββ tests/ # Test files
Contributing
Issues and Pull Requests are welcome!