KaniTTS / CLAUDE.md
Den Pavloff
fix token conflict
8a1b058
|
raw
history blame
3.99 kB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

KaniTTS is a Text-to-Speech system that uses causal language models to generate speech via NeMo audio codec tokens. The project is deployed as a HuggingFace Gradio Space.

Running the Application

# Run the Gradio app (launches on http://0.0.0.0:7860)
python app.py

The app requires a HuggingFace token set as the HF_TOKEN environment variable to download models.

Architecture

Token Flow Pipeline

The system uses a custom token layout that interleaves text and audio in a single sequence:

  1. Input prompt construction (KaniModel.get_input_ids):

    • START_OF_HUMAN → text tokens → END_OF_TEXTEND_OF_HUMAN
    • Optionally prefixed with speaker ID (e.g., "andrew: Hello world")
  2. LLM generation (KaniModel.model_request):

    • Model generates sequence containing: text section + START_OF_SPEECH + audio codec tokens + END_OF_SPEECH
  3. Audio decoding (NemoAudioPlayer.get_waveform):

    • Extracts audio tokens between START_OF_SPEECH and END_OF_SPEECH
    • Audio tokens are arranged in 4 interleaved codebooks (q=4)
    • Tokens are offset by audio_tokens_start + (codebook_size * codebook_index)
    • NeMo codec reconstructs waveform from the 4 codebooks

Key Classes

NemoAudioPlayer (util.py:27-170)

  • Loads NeMo AudioCodecModel for waveform reconstruction
  • Manages special token IDs (derived from tokeniser_length base)
  • Validates output has required speech markers
  • Extracts and decodes 4-codebook audio tokens from LLM output
  • Returns 22050 Hz audio as NumPy array

KaniModel (util.py:172-303)

  • Wraps HuggingFace causal LM (loaded with bfloat16, auto device mapping)
  • Prepares prompts with conversation/modality control tokens
  • Runs generation with sampling parameters (temp, top_p, repetition_penalty)
  • Delegates audio reconstruction to NemoAudioPlayer
  • Returns tuple: (audio_array, text, timing_report)

InitModels (util.py:305-343)

  • Factory that loads all models from model_config.yaml at startup
  • Returns dict mapping model names to KaniModel instances
  • All models share the same NemoAudioPlayer instance

Examples (util.py:345-387)

  • Converts examples.yaml structure into Gradio Examples format
  • Output order: [text, model, speaker_id, temperature, top_p, repetition_penalty, max_len]

Configuration Files

model_config.yaml

  • nemo_player: NeMo codec config (model name, token layout constants)
  • models: Dict of available TTS models with device_map and optional speaker_id mappings

examples.yaml

  • List of example prompts with associated parameters for Gradio UI

Dependency Setup

create_env.py runs before imports in app.py to:

  • Install transformers from git main branch (required for compatibility)
  • Set OMP_NUM_THREADS=4
  • Uses /tmp/deps_installed marker to avoid reinstalling on every run

Important Token Constants

All special tokens are defined relative to tokeniser_length (64400):

  • start_of_speech = tokeniser_length + 1
  • end_of_speech = tokeniser_length + 2
  • start_of_human = tokeniser_length + 3
  • end_of_human = tokeniser_length + 4
  • start_of_ai = tokeniser_length + 5
  • end_of_ai = tokeniser_length + 6
  • pad_token = tokeniser_length + 7
  • audio_tokens_start = tokeniser_length + 10
  • codebook_size = 4032

Multi-Speaker Support

Models with speaker_id mappings in model_config.yaml support voice selection:

  • Speaker IDs are prefixed to the text prompt (e.g., "andrew: Hello")
  • The Gradio UI shows/hides speaker dropdown based on selected model
  • Base models (v.0.1, v.0.2) generate random voices without speaker control

HuggingFace Spaces Deployment

The README.md header contains HF Spaces metadata:

  • sdk: gradio with version 5.46.0
  • app_file: app.py as entrypoint
  • References 3 model checkpoints and the NeMo codec