Spaces:
Running
on
Zero
Running
on
Zero
# XY Tokenizer | |
XY Tokenizer is a speech codec that simultaneously models both semantic and acoustic aspects of speech, converting audio into discrete tokens and decoding them back to high-quality audio. It achieves efficient speech representation at only 1kbps with RVQ8 quantization at 12.5Hz frame rate. | |
## Features | |
- **Dual-channel modeling**: Simultaneously captures semantic meaning and acoustic details | |
- **Efficient representation**: 1kbps bitrate with RVQ8 quantization at 12.5Hz | |
- **High-quality audio tokenization**: Convert speech to discrete tokens and back with minimal quality loss | |
- **Long audio support**: Process audio files longer than 30 seconds using chunking with overlap | |
- **Batch processing**: Efficiently process multiple audio files in batches | |
- **24kHz output**: Generate high-quality 24kHz audio output | |
## Installation | |
```bash | |
# Create and activate conda environment | |
conda create -n xy_tokenizer python=3.10 -y && conda activate xy_tokenizer | |
# Install dependencies | |
pip install -r requirements.txt | |
``` | |
## Usage | |
### Basic Inference | |
To tokenize audio files and reconstruct them: | |
```bash | |
python inference.py \ | |
--config_path ./config/xy_tokenizer_config.yaml \ | |
--checkpoint_path ./weights/xy_tokenizer.ckpt \ | |
--input_dir ./input_wavs/ \ | |
--output_dir ./output_wavs/ | |
``` | |
### Parameters | |
- `--config_path`: Path to the model configuration file | |
- `--checkpoint_path`: Path to the pre-trained model checkpoint | |
- `--input_dir`: Directory containing input WAV files | |
- `--output_dir`: Directory to save reconstructed audio files | |
- `--device`: Device to run inference on (default: "cuda") | |
- `--debug`, `--debug_ip`, `--debug_port`: Debugging options (disabled by default) | |
## Project Structure | |
- `xy_tokenizer/`: Core model implementation | |
- `model.py`: Main XY_Tokenizer model class | |
- `nn/`: Neural network components | |
- `config/`: Configuration files | |
- `utils/`: Utility functions | |
- `weights/`: Pre-trained model weights | |
- `input_wavs/`: Directory for input audio files | |
- `output_wavs/`: Directory for output audio files | |
## Model Architecture | |
XY Tokenizer uses a dual-channel architecture that simultaneously models: | |
1. **Semantic Channel**: Captures high-level semantic information and linguistic content | |
2. **Acoustic Channel**: Preserves detailed acoustic features including speaker characteristics and prosody | |
The model processes audio through several stages: | |
1. Feature extraction (mel-spectrogram) | |
2. Parallel semantic and acoustic encoding | |
3. Residual Vector Quantization (RVQ8) at 12.5Hz frame rate (1kbps) | |
4. Decoding and waveform generation | |
## License | |
[Specify your license here] |