Spaces:
Running
on
Zero
Running
on
Zero
# CLAUDE.md | |
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. | |
## Project Overview | |
KaniTTS is a Text-to-Speech system that uses causal language models to generate speech via NeMo audio codec tokens. The project is deployed as a HuggingFace Gradio Space. | |
## Running the Application | |
```bash | |
# Run the Gradio app (launches on http://0.0.0.0:7860) | |
python app.py | |
``` | |
The app requires a HuggingFace token set as the `HF_TOKEN` environment variable to download models. | |
## Architecture | |
### Token Flow Pipeline | |
The system uses a custom token layout that interleaves text and audio in a single sequence: | |
1. **Input prompt construction** (`KaniModel.get_input_ids`): | |
- `START_OF_HUMAN` β text tokens β `END_OF_TEXT` β `END_OF_HUMAN` | |
- Optionally prefixed with speaker ID (e.g., "andrew: Hello world") | |
2. **LLM generation** (`KaniModel.model_request`): | |
- Model generates sequence containing: text section + `START_OF_SPEECH` + audio codec tokens + `END_OF_SPEECH` | |
3. **Audio decoding** (`NemoAudioPlayer.get_waveform`): | |
- Extracts audio tokens between `START_OF_SPEECH` and `END_OF_SPEECH` | |
- Audio tokens are arranged in 4 interleaved codebooks (q=4) | |
- Tokens are offset by `audio_tokens_start + (codebook_size * codebook_index)` | |
- NeMo codec reconstructs waveform from the 4 codebooks | |
### Key Classes | |
**`NemoAudioPlayer`** (util.py:27-170) | |
- Loads NeMo AudioCodecModel for waveform reconstruction | |
- Manages special token IDs (derived from `tokeniser_length` base) | |
- Validates output has required speech markers | |
- Extracts and decodes 4-codebook audio tokens from LLM output | |
- Returns 22050 Hz audio as NumPy array | |
**`KaniModel`** (util.py:172-303) | |
- Wraps HuggingFace causal LM (loaded with bfloat16, auto device mapping) | |
- Prepares prompts with conversation/modality control tokens | |
- Runs generation with sampling parameters (temp, top_p, repetition_penalty) | |
- Delegates audio reconstruction to `NemoAudioPlayer` | |
- Returns tuple: (audio_array, text, timing_report) | |
**`InitModels`** (util.py:305-343) | |
- Factory that loads all models from `model_config.yaml` at startup | |
- Returns dict mapping model names to `KaniModel` instances | |
- All models share the same `NemoAudioPlayer` instance | |
**`Examples`** (util.py:345-387) | |
- Converts `examples.yaml` structure into Gradio Examples format | |
- Output order: `[text, model, speaker_id, temperature, top_p, repetition_penalty, max_len]` | |
### Configuration Files | |
**`model_config.yaml`** | |
- `nemo_player`: NeMo codec config (model name, token layout constants) | |
- `models`: Dict of available TTS models with device_map and optional speaker_id mappings | |
**`examples.yaml`** | |
- List of example prompts with associated parameters for Gradio UI | |
### Dependency Setup | |
`create_env.py` runs before imports in `app.py` to: | |
- Install transformers from git main branch (required for compatibility) | |
- Set `OMP_NUM_THREADS=4` | |
- Uses `/tmp/deps_installed` marker to avoid reinstalling on every run | |
## Important Token Constants | |
All special tokens are defined relative to `tokeniser_length` (64400): | |
- `start_of_speech = tokeniser_length + 1` | |
- `end_of_speech = tokeniser_length + 2` | |
- `start_of_human = tokeniser_length + 3` | |
- `end_of_human = tokeniser_length + 4` | |
- `start_of_ai = tokeniser_length + 5` | |
- `end_of_ai = tokeniser_length + 6` | |
- `pad_token = tokeniser_length + 7` | |
- `audio_tokens_start = tokeniser_length + 10` | |
- `codebook_size = 4032` | |
## Multi-Speaker Support | |
Models with `speaker_id` mappings in `model_config.yaml` support voice selection: | |
- Speaker IDs are prefixed to the text prompt (e.g., "andrew: Hello") | |
- The Gradio UI shows/hides speaker dropdown based on selected model | |
- Base models (v.0.1, v.0.2) generate random voices without speaker control | |
## HuggingFace Spaces Deployment | |
The README.md header contains HF Spaces metadata: | |
- `sdk: gradio` with version 5.46.0 | |
- `app_file: app.py` as entrypoint | |
- References 3 model checkpoints and the NeMo codec | |