Spaces:

nineninesix
/

KaniTTS

Running on Zero

KaniTTS / CLAUDE.md

Den Pavloff

fix token conflict

8a1b058 3 days ago

3.99 kB

	# CLAUDE.md

	This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

	## Project Overview

	KaniTTS is a Text-to-Speech system that uses causal language models to generate speech via NeMo audio codec tokens. The project is deployed as a HuggingFace Gradio Space.

	## Running the Application

	```bash
	# Run the Gradio app (launches on http://0.0.0.0:7860)
	python app.py
	```

	The app requires a HuggingFace token set as the `HF_TOKEN` environment variable to download models.

	## Architecture

	### Token Flow Pipeline

	The system uses a custom token layout that interleaves text and audio in a single sequence:

	1. Input prompt construction (`KaniModel.get_input_ids`):
	- `START_OF_HUMAN` → text tokens → `END_OF_TEXT` → `END_OF_HUMAN`
	- Optionally prefixed with speaker ID (e.g., "andrew: Hello world")

	2. LLM generation (`KaniModel.model_request`):
	- Model generates sequence containing: text section + `START_OF_SPEECH` + audio codec tokens + `END_OF_SPEECH`

	3. Audio decoding (`NemoAudioPlayer.get_waveform`):
	- Extracts audio tokens between `START_OF_SPEECH` and `END_OF_SPEECH`
	- Audio tokens are arranged in 4 interleaved codebooks (q=4)
	- Tokens are offset by `audio_tokens_start + (codebook_size * codebook_index)`
	- NeMo codec reconstructs waveform from the 4 codebooks

	### Key Classes

	`NemoAudioPlayer` (util.py:27-170)
	- Loads NeMo AudioCodecModel for waveform reconstruction
	- Manages special token IDs (derived from `tokeniser_length` base)
	- Validates output has required speech markers
	- Extracts and decodes 4-codebook audio tokens from LLM output
	- Returns 22050 Hz audio as NumPy array

	`KaniModel` (util.py:172-303)
	- Wraps HuggingFace causal LM (loaded with bfloat16, auto device mapping)
	- Prepares prompts with conversation/modality control tokens
	- Runs generation with sampling parameters (temp, top_p, repetition_penalty)
	- Delegates audio reconstruction to `NemoAudioPlayer`
	- Returns tuple: (audio_array, text, timing_report)

	`InitModels` (util.py:305-343)
	- Factory that loads all models from `model_config.yaml` at startup
	- Returns dict mapping model names to `KaniModel` instances
	- All models share the same `NemoAudioPlayer` instance

	`Examples` (util.py:345-387)
	- Converts `examples.yaml` structure into Gradio Examples format
	- Output order: `[text, model, speaker_id, temperature, top_p, repetition_penalty, max_len]`

	### Configuration Files

	`model_config.yaml`
	- `nemo_player`: NeMo codec config (model name, token layout constants)
	- `models`: Dict of available TTS models with device_map and optional speaker_id mappings

	`examples.yaml`
	- List of example prompts with associated parameters for Gradio UI

	### Dependency Setup

	`create_env.py` runs before imports in `app.py` to:
	- Install transformers from git main branch (required for compatibility)
	- Set `OMP_NUM_THREADS=4`
	- Uses `/tmp/deps_installed` marker to avoid reinstalling on every run

	## Important Token Constants

	All special tokens are defined relative to `tokeniser_length` (64400):
	- `start_of_speech = tokeniser_length + 1`
	- `end_of_speech = tokeniser_length + 2`
	- `start_of_human = tokeniser_length + 3`
	- `end_of_human = tokeniser_length + 4`
	- `start_of_ai = tokeniser_length + 5`
	- `end_of_ai = tokeniser_length + 6`
	- `pad_token = tokeniser_length + 7`
	- `audio_tokens_start = tokeniser_length + 10`
	- `codebook_size = 4032`

	## Multi-Speaker Support

	Models with `speaker_id` mappings in `model_config.yaml` support voice selection:
	- Speaker IDs are prefixed to the text prompt (e.g., "andrew: Hello")
	- The Gradio UI shows/hides speaker dropdown based on selected model
	- Base models (v.0.1, v.0.2) generate random voices without speaker control

	## HuggingFace Spaces Deployment

	The README.md header contains HF Spaces metadata:
	- `sdk: gradio` with version 5.46.0
	- `app_file: app.py` as entrypoint
	- References 3 model checkpoints and the NeMo codec