File size: 3,990 Bytes
8a1b058
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

KaniTTS is a Text-to-Speech system that uses causal language models to generate speech via NeMo audio codec tokens. The project is deployed as a HuggingFace Gradio Space.

## Running the Application

```bash
# Run the Gradio app (launches on http://0.0.0.0:7860)
python app.py
```

The app requires a HuggingFace token set as the `HF_TOKEN` environment variable to download models.

## Architecture

### Token Flow Pipeline

The system uses a custom token layout that interleaves text and audio in a single sequence:

1. **Input prompt construction** (`KaniModel.get_input_ids`):
   - `START_OF_HUMAN` → text tokens → `END_OF_TEXT``END_OF_HUMAN`
   - Optionally prefixed with speaker ID (e.g., "andrew: Hello world")

2. **LLM generation** (`KaniModel.model_request`):
   - Model generates sequence containing: text section + `START_OF_SPEECH` + audio codec tokens + `END_OF_SPEECH`

3. **Audio decoding** (`NemoAudioPlayer.get_waveform`):
   - Extracts audio tokens between `START_OF_SPEECH` and `END_OF_SPEECH`
   - Audio tokens are arranged in 4 interleaved codebooks (q=4)
   - Tokens are offset by `audio_tokens_start + (codebook_size * codebook_index)`
   - NeMo codec reconstructs waveform from the 4 codebooks

### Key Classes

**`NemoAudioPlayer`** (util.py:27-170)
- Loads NeMo AudioCodecModel for waveform reconstruction
- Manages special token IDs (derived from `tokeniser_length` base)
- Validates output has required speech markers
- Extracts and decodes 4-codebook audio tokens from LLM output
- Returns 22050 Hz audio as NumPy array

**`KaniModel`** (util.py:172-303)
- Wraps HuggingFace causal LM (loaded with bfloat16, auto device mapping)
- Prepares prompts with conversation/modality control tokens
- Runs generation with sampling parameters (temp, top_p, repetition_penalty)
- Delegates audio reconstruction to `NemoAudioPlayer`
- Returns tuple: (audio_array, text, timing_report)

**`InitModels`** (util.py:305-343)
- Factory that loads all models from `model_config.yaml` at startup
- Returns dict mapping model names to `KaniModel` instances
- All models share the same `NemoAudioPlayer` instance

**`Examples`** (util.py:345-387)
- Converts `examples.yaml` structure into Gradio Examples format
- Output order: `[text, model, speaker_id, temperature, top_p, repetition_penalty, max_len]`

### Configuration Files

**`model_config.yaml`**
- `nemo_player`: NeMo codec config (model name, token layout constants)
- `models`: Dict of available TTS models with device_map and optional speaker_id mappings

**`examples.yaml`**
- List of example prompts with associated parameters for Gradio UI

### Dependency Setup

`create_env.py` runs before imports in `app.py` to:
- Install transformers from git main branch (required for compatibility)
- Set `OMP_NUM_THREADS=4`
- Uses `/tmp/deps_installed` marker to avoid reinstalling on every run

## Important Token Constants

All special tokens are defined relative to `tokeniser_length` (64400):
- `start_of_speech = tokeniser_length + 1`
- `end_of_speech = tokeniser_length + 2`
- `start_of_human = tokeniser_length + 3`
- `end_of_human = tokeniser_length + 4`
- `start_of_ai = tokeniser_length + 5`
- `end_of_ai = tokeniser_length + 6`
- `pad_token = tokeniser_length + 7`
- `audio_tokens_start = tokeniser_length + 10`
- `codebook_size = 4032`

## Multi-Speaker Support

Models with `speaker_id` mappings in `model_config.yaml` support voice selection:
- Speaker IDs are prefixed to the text prompt (e.g., "andrew: Hello")
- The Gradio UI shows/hides speaker dropdown based on selected model
- Base models (v.0.1, v.0.2) generate random voices without speaker control

## HuggingFace Spaces Deployment

The README.md header contains HF Spaces metadata:
- `sdk: gradio` with version 5.46.0
- `app_file: app.py` as entrypoint
- References 3 model checkpoints and the NeMo codec