metadata

library_name: transformers
license: apache-2.0
pipeline_tag: text-to-speech

CSM-1B-HF

Sesame CSM 1B model weights for my Hugging Face implementation.

Overview

CSM-HF is a Hugging Face implementation of Sesame's Conversational Speech Model (CSM). CSM-HF is a complete rewrite of the pytorch code provided by Sesame. This codebase is designed to be fully compatible with Hugging Face transformers, from inference to training.

Changes from Sesame's implementation

created a CSMModel class
replaced backbone and decoder torchtune models with HF transformers LllamaModel
added a processor class to prepare inputs for the model
added labels support and decoder training amortization
added generate_frame and generate methods to the model class for generating audio
full support for HuggingFace Trainer

Generation

You can use the model to generate audio from text input. Here's an example for voice cloning:

import torch
from modeling_csm import CSMModel
from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer
from tokenizers.processors import TemplateProcessing
from moshi.models import loaders
from processor import CSMProcessor
import torchaudio

device = 'cuda'

def load_llama3_tokenizer():
    """
    https://github.com/huggingface/transformers/issues/22794#issuecomment-2092623992
    """
    tokenizer_name = "meta-llama/Llama-3.2-1B"
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
    bos = tokenizer.bos_token
    eos = tokenizer.eos_token
    tokenizer._tokenizer.post_processor = TemplateProcessing(
        single=f"{bos}:0 $A:0 {eos}:0",
        pair=f"{bos}:0 $A:0 {eos}:0 {bos}:1 $B:1 {eos}:1",
        special_tokens=[(f"{bos}", tokenizer.bos_token_id), (f"{eos}", tokenizer.eos_token_id)],
    )

    return tokenizer

text_tokenizer = load_llama3_tokenizer()

mimi_weight = hf_hub_download(loaders.DEFAULT_REPO, loaders.MIMI_NAME)
audio_tokenizer = loaders.get_mimi(mimi_weight, device=device)
audio_tokenizer.set_num_codebooks(32)

processor = CSMProcessor(text_tokenizer, audio_tokenizer)


def load_audio(path, target_sr):
    audio, sr = torchaudio.load(path)
    audio = audio.squeeze(0)
    if sr != target_sr:
        audio = torchaudio.functional.resample(audio, orig_freq=sr, new_freq=target_sr)
    return audio


model = CSMModel.from_pretrained("thomasgauthier/csm-1b-hf", torch_dtype=torch.bfloat16)
model.to('cuda')


inputs = processor(
    messages=[
        {
        "role": "speaker_0",
        "content": [
            {"type": "text", "text": "<AUDIO_CLIP_TRANSCRIPT>"},
            {"type": "audio"} # This placeholder is required for audio tokenization (it maps to the first element in the `audios` list passed to the processor)
        ]
    },
            {
        "role": "speaker_0",
        "content": [
            {"type": "text", "text": "Hello, this is voice cloning speaking"},
            # does not include audio as the model will generate it
        ]
    }
        ], 
    audios=[load_audio('AUDIO_CLIP_FOR_VOICE_CLONING.wav', audio_tokenizer.sample_rate)],
    return_tensors="pt"
)

import torch

with torch.inference_mode():
    # Generate up to 50 new frames
    gen_frames = model.generate(
        input_ids=inputs['input_ids'].cuda(),
        attention_mask=inputs['attention_mask'].cuda(),
        max_new_frames=50,
        topk=50,
        temperature=1.0,
        use_cache=True,
        stop_on_all_zeros=True,

    )

decoded_audio = audio_tokenizer.decode(gen_frames.permute(0, 2, 1)).squeeze(0).squeeze(0)

audio_array = (decoded_audio * 32768).to(torch.int16).cpu().numpy()

# Audio can be played with the following code:
# from IPython.display import Audio
# Audio(audio_array, rate=audio_tokenizer.sample_rate)

Architecture

Model architecture is discussed in ARCHITECTURE.md (written by O1)

Training

Data Format

CSM-HF expects training data in a JSONL format, where each line is a JSON object containing a conversation. Each conversation consists of:

messages: An array of message objects, each with:
- role: Speaker identifier (e.g., "speaker_0", "speaker_1")
- content: Array of content objects, which can be:
  - Text: {"type": "text", "text": "The message text"}
  - Audio: {"type": "audio", "url": "path/to/audio/file.wav"}
training_mask: Boolean array indicating which messages should be used for training (true) or context (false)

Example data format:

{
  "messages": [
    {
      "role": "speaker_0",
      "content": [
        {"type": "text", "text": "We have a chance for a new life here."},
        {"type": "audio", "url": "clips/example_audio.wav"}
      ]
    },
    {
      "role": "speaker_1",
      "content": [
        {"type": "text", "text": "Uncle?"},
        {"type": "audio", "url": "clips/response_audio.wav"}
      ]
    }
  ],
  "training_mask": [false, true]
}

Training Process

The model uses a two-stage autoregressive architecture:

Backbone (Inter-frame Processing):
- Processes the entire sequence of frames
- Each frame represents a combined embedding of all codebooks
- Handles long-range dependencies between utterances
Decoder (Intra-frame Processing):
- Processes a single frame at a time
- Generates 32 codebooks sequentially (1 semantic + 31 acoustic)
- Each codebook is treated as a token in the sequence

Training leverages compute amortization techniques:

The zeroth (semantic) codebook is trained on all frames
The remaining codebooks (1-31) are trained on only amortization_ratio of the frames
This significantly reduces memory usage while maintaining quality

To train the model:

python train.py \
  --train_file path/to/training_data.jsonl \
  --output_dir ./output \
  --num_train_epochs 3 \
  --per_device_train_batch_size 1 \
  --gradient_accumulation_steps 8 \
  --learning_rate 5e-6

TODO

Two-stage autoregressive architecture implementation
Multi-codebook audio tokenization
Compute amortization for efficient training
Dataset preparation with interleaved text/audio
Custom training loop with separate backbone/decoder losses
Proper handling of epoch repetition for decoder amortization
Memory optimization techniques (mixed precision, gradient accumulation)
LoRA support for efficient fine-tuning
Faster inference with torch.compile
Coice cloning with prompt tuning / prefix optimization
Support for DPO
Support for RL (GRPO, RLOO, etc.)

Acknowledgements

Special thanks to:

Sesame Labs for the original architecture design and implementation
Hugging Face for the Transformers library and training infrastructure
Claude and ChatGPT for assistance with documentation and code development

This project builds upon research and tools from the open-source community. I am grateful for the collaborative spirit that makes projects like this possible.