# CLAUDE.md This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. ## Project Overview KaniTTS is a Text-to-Speech system that uses causal language models to generate speech via NeMo audio codec tokens. The project is deployed as a HuggingFace Gradio Space. ## Running the Application ```bash # Run the Gradio app (launches on http://0.0.0.0:7860) python app.py ``` The app requires a HuggingFace token set as the `HF_TOKEN` environment variable to download models. ## Architecture ### Token Flow Pipeline The system uses a custom token layout that interleaves text and audio in a single sequence: 1. **Input prompt construction** (`KaniModel.get_input_ids`): - `START_OF_HUMAN` → text tokens → `END_OF_TEXT` → `END_OF_HUMAN` - Optionally prefixed with speaker ID (e.g., "andrew: Hello world") 2. **LLM generation** (`KaniModel.model_request`): - Model generates sequence containing: text section + `START_OF_SPEECH` + audio codec tokens + `END_OF_SPEECH` 3. **Audio decoding** (`NemoAudioPlayer.get_waveform`): - Extracts audio tokens between `START_OF_SPEECH` and `END_OF_SPEECH` - Audio tokens are arranged in 4 interleaved codebooks (q=4) - Tokens are offset by `audio_tokens_start + (codebook_size * codebook_index)` - NeMo codec reconstructs waveform from the 4 codebooks ### Key Classes **`NemoAudioPlayer`** (util.py:27-170) - Loads NeMo AudioCodecModel for waveform reconstruction - Manages special token IDs (derived from `tokeniser_length` base) - Validates output has required speech markers - Extracts and decodes 4-codebook audio tokens from LLM output - Returns 22050 Hz audio as NumPy array **`KaniModel`** (util.py:172-303) - Wraps HuggingFace causal LM (loaded with bfloat16, auto device mapping) - Prepares prompts with conversation/modality control tokens - Runs generation with sampling parameters (temp, top_p, repetition_penalty) - Delegates audio reconstruction to `NemoAudioPlayer` - Returns tuple: (audio_array, text, timing_report) **`InitModels`** (util.py:305-343) - Factory that loads all models from `model_config.yaml` at startup - Returns dict mapping model names to `KaniModel` instances - All models share the same `NemoAudioPlayer` instance **`Examples`** (util.py:345-387) - Converts `examples.yaml` structure into Gradio Examples format - Output order: `[text, model, speaker_id, temperature, top_p, repetition_penalty, max_len]` ### Configuration Files **`model_config.yaml`** - `nemo_player`: NeMo codec config (model name, token layout constants) - `models`: Dict of available TTS models with device_map and optional speaker_id mappings **`examples.yaml`** - List of example prompts with associated parameters for Gradio UI ### Dependency Setup `create_env.py` runs before imports in `app.py` to: - Install transformers from git main branch (required for compatibility) - Set `OMP_NUM_THREADS=4` - Uses `/tmp/deps_installed` marker to avoid reinstalling on every run ## Important Token Constants All special tokens are defined relative to `tokeniser_length` (64400): - `start_of_speech = tokeniser_length + 1` - `end_of_speech = tokeniser_length + 2` - `start_of_human = tokeniser_length + 3` - `end_of_human = tokeniser_length + 4` - `start_of_ai = tokeniser_length + 5` - `end_of_ai = tokeniser_length + 6` - `pad_token = tokeniser_length + 7` - `audio_tokens_start = tokeniser_length + 10` - `codebook_size = 4032` ## Multi-Speaker Support Models with `speaker_id` mappings in `model_config.yaml` support voice selection: - Speaker IDs are prefixed to the text prompt (e.g., "andrew: Hello") - The Gradio UI shows/hides speaker dropdown based on selected model - Base models (v.0.1, v.0.2) generate random voices without speaker control ## HuggingFace Spaces Deployment The README.md header contains HF Spaces metadata: - `sdk: gradio` with version 5.46.0 - `app_file: app.py` as entrypoint - References 3 model checkpoints and the NeMo codec