Spaces:

fnlp
/

MOSS-TTSD

Running on Zero

App Files Files Community

MOSS-TTSD / XY_Tokenizer /README.md

yhzx233

feat: app.py

ea174b0 about 2 months ago

preview code

raw

history blame contribute delete

2.63 kB

	# XY Tokenizer

	XY Tokenizer is a speech codec that simultaneously models both semantic and acoustic aspects of speech, converting audio into discrete tokens and decoding them back to high-quality audio. It achieves efficient speech representation at only 1kbps with RVQ8 quantization at 12.5Hz frame rate.

	## Features

	- Dual-channel modeling: Simultaneously captures semantic meaning and acoustic details
	- Efficient representation: 1kbps bitrate with RVQ8 quantization at 12.5Hz
	- High-quality audio tokenization: Convert speech to discrete tokens and back with minimal quality loss
	- Long audio support: Process audio files longer than 30 seconds using chunking with overlap
	- Batch processing: Efficiently process multiple audio files in batches
	- 24kHz output: Generate high-quality 24kHz audio output

	## Installation

	```bash
	# Create and activate conda environment
	conda create -n xy_tokenizer python=3.10 -y && conda activate xy_tokenizer

	# Install dependencies
	pip install -r requirements.txt
	```

	## Usage

	### Basic Inference

	To tokenize audio files and reconstruct them:

	```bash
	python inference.py \
	--config_path ./config/xy_tokenizer_config.yaml \
	--checkpoint_path ./weights/xy_tokenizer.ckpt \
	--input_dir ./input_wavs/ \
	--output_dir ./output_wavs/
	```

	### Parameters

	- `--config_path`: Path to the model configuration file
	- `--checkpoint_path`: Path to the pre-trained model checkpoint
	- `--input_dir`: Directory containing input WAV files
	- `--output_dir`: Directory to save reconstructed audio files
	- `--device`: Device to run inference on (default: "cuda")
	- `--debug`, `--debug_ip`, `--debug_port`: Debugging options (disabled by default)

	## Project Structure

	- `xy_tokenizer/`: Core model implementation
	- `model.py`: Main XY_Tokenizer model class
	- `nn/`: Neural network components
	- `config/`: Configuration files
	- `utils/`: Utility functions
	- `weights/`: Pre-trained model weights
	- `input_wavs/`: Directory for input audio files
	- `output_wavs/`: Directory for output audio files

	## Model Architecture

	XY Tokenizer uses a dual-channel architecture that simultaneously models:
	1. Semantic Channel: Captures high-level semantic information and linguistic content
	2. Acoustic Channel: Preserves detailed acoustic features including speaker characteristics and prosody

	The model processes audio through several stages:
	1. Feature extraction (mel-spectrogram)
	2. Parallel semantic and acoustic encoding
	3. Residual Vector Quantization (RVQ8) at 12.5Hz frame rate (1kbps)
	4. Decoding and waveform generation

	## License

	[Specify your license here]