csm-1b-hf / README.md

Update README.md

04b5967 verified 6 months ago

7.49 kB

	---
	library_name: transformers
	license: apache-2.0
	pipeline_tag: text-to-speech
	---

	# CSM-1B-HF

	## Sesame CSM 1B model weights for my [Hugging Face implementation](https://github.com/thomasgauthier/csm-hf/).

	---

	## Overview

	CSM-HF is a Hugging Face implementation of [Sesame's Conversational Speech Model (CSM)](https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice). CSM-HF is a complete rewrite of the [pytorch code provided by Sesame](https://github.com/SesameAILabs/csm). This codebase is designed to be fully compatible with Hugging Face `transformers`, from inference to training.

	## Changes from Sesame's implementation

	- created a `CSMModel` class
	- replaced backbone and decoder torchtune models with HF transformers `LllamaModel`
	- added a processor class to prepare inputs for the model
	- added labels support and [decoder training amortization](https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice#:~:text=The%20audio%20decoder%20is%20trained%20on%20only%20a%20random%201/16%20subset%20of%20the%20audio%20frames%2C%20while%20the%20zeroth%20codebook%20is%20trained%20on%20every%20frame.)
	- added `generate_frame` and `generate` methods to the model class for generating audio
	- full support for HuggingFace `Trainer`

	## Generation

	You can use the model to generate audio from text input. Here's an example for voice cloning:

	```python
	import torch
	from modeling_csm import CSMModel
	from huggingface_hub import hf_hub_download
	from transformers import AutoTokenizer
	from tokenizers.processors import TemplateProcessing
	from moshi.models import loaders
	from processor import CSMProcessor
	import torchaudio

	device = 'cuda'

	def load_llama3_tokenizer():
	"""
	https://github.com/huggingface/transformers/issues/22794#issuecomment-2092623992
	"""
	tokenizer_name = "meta-llama/Llama-3.2-1B"
	tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
	bos = tokenizer.bos_token
	eos = tokenizer.eos_token
	tokenizer._tokenizer.post_processor = TemplateProcessing(
	single=f"{bos}:0 $A:0 {eos}:0",
	pair=f"{bos}:0 $A:0 {eos}:0 {bos}:1 $B:1 {eos}:1",
	special_tokens=[(f"{bos}", tokenizer.bos_token_id), (f"{eos}", tokenizer.eos_token_id)],
	)

	return tokenizer

	text_tokenizer = load_llama3_tokenizer()

	mimi_weight = hf_hub_download(loaders.DEFAULT_REPO, loaders.MIMI_NAME)
	audio_tokenizer = loaders.get_mimi(mimi_weight, device=device)
	audio_tokenizer.set_num_codebooks(32)

	processor = CSMProcessor(text_tokenizer, audio_tokenizer)


	def load_audio(path, target_sr):
	audio, sr = torchaudio.load(path)
	audio = audio.squeeze(0)
	if sr != target_sr:
	audio = torchaudio.functional.resample(audio, orig_freq=sr, new_freq=target_sr)
	return audio


	model = CSMModel.from_pretrained("thomasgauthier/csm-1b-hf", torch_dtype=torch.bfloat16)
	model.to('cuda')


	inputs = processor(
	messages=[
	{
	"role": "speaker_0",
	"content": [
	{"type": "text", "text": "<AUDIO_CLIP_TRANSCRIPT>"},
	{"type": "audio"} # This placeholder is required for audio tokenization (it maps to the first element in the `audios` list passed to the processor)
	]
	},
	{
	"role": "speaker_0",
	"content": [
	{"type": "text", "text": "Hello, this is voice cloning speaking"},
	# does not include audio as the model will generate it
	]
	}
	],
	audios=[load_audio('AUDIO_CLIP_FOR_VOICE_CLONING.wav', audio_tokenizer.sample_rate)],
	return_tensors="pt"
	)

	import torch

	with torch.inference_mode():
	# Generate up to 50 new frames
	gen_frames = model.generate(
	input_ids=inputs['input_ids'].cuda(),
	attention_mask=inputs['attention_mask'].cuda(),
	max_new_frames=50,
	topk=50,
	temperature=1.0,
	use_cache=True,
	stop_on_all_zeros=True,

	)

	decoded_audio = audio_tokenizer.decode(gen_frames.permute(0, 2, 1)).squeeze(0).squeeze(0)

	audio_array = (decoded_audio * 32768).to(torch.int16).cpu().numpy()

	# Audio can be played with the following code:
	# from IPython.display import Audio
	# Audio(audio_array, rate=audio_tokenizer.sample_rate)
	```

	## Architecture

	Model architecture is discussed in [ARCHITECTURE.md](https://github.com/thomasgauthier/csm-hf/blob/main/ARCHITECTURE.md) (written by O1)

	## Training

	### Data Format

	CSM-HF expects training data in a JSONL format, where each line is a JSON object containing a conversation. Each conversation consists of:

	- `messages`: An array of message objects, each with:
	- `role`: Speaker identifier (e.g., "speaker_0", "speaker_1")
	- `content`: Array of content objects, which can be:
	- Text: `{"type": "text", "text": "The message text"}`
	- Audio: `{"type": "audio", "url": "path/to/audio/file.wav"}`
	- `training_mask`: Boolean array indicating which messages should be used for training (true) or context (false)

	Example data format:

	```json
	{
	"messages": [
	{
	"role": "speaker_0",
	"content": [
	{"type": "text", "text": "We have a chance for a new life here."},
	{"type": "audio", "url": "clips/example_audio.wav"}
	]
	},
	{
	"role": "speaker_1",
	"content": [
	{"type": "text", "text": "Uncle?"},
	{"type": "audio", "url": "clips/response_audio.wav"}
	]
	}
	],
	"training_mask": [false, true]
	}
	```

	### Training Process

	The model uses a two-stage autoregressive architecture:

	1. Backbone (Inter-frame Processing):
	- Processes the entire sequence of frames
	- Each frame represents a combined embedding of all codebooks
	- Handles long-range dependencies between utterances

	2. Decoder (Intra-frame Processing):
	- Processes a single frame at a time
	- Generates 32 codebooks sequentially (1 semantic + 31 acoustic)
	- Each codebook is treated as a token in the sequence

	Training leverages compute amortization techniques:
	- The zeroth (semantic) codebook is trained on all frames
	- The remaining codebooks (1-31) are trained on only `amortization_ratio` of the frames
	- This significantly reduces memory usage while maintaining quality

	To train the model:

	```bash
	python train.py \
	--train_file path/to/training_data.jsonl \
	--output_dir ./output \
	--num_train_epochs 3 \
	--per_device_train_batch_size 1 \
	--gradient_accumulation_steps 8 \
	--learning_rate 5e-6
	```


	## TODO

	- [x] Two-stage autoregressive architecture implementation
	- [x] Multi-codebook audio tokenization
	- [x] Compute amortization for efficient training
	- [x] Dataset preparation with interleaved text/audio
	- [x] Custom training loop with separate backbone/decoder losses
	- [x] Proper handling of epoch repetition for decoder amortization
	- [x] Memory optimization techniques (mixed precision, gradient accumulation)
	- [ ] LoRA support for efficient fine-tuning
	- [ ] Faster inference with `torch.compile`
	- [ ] Coice cloning with prompt tuning / prefix optimization
	- [ ] Support for DPO
	- [ ] Support for RL (GRPO, RLOO, etc.)

	## Acknowledgements

	Special thanks to:
	- Sesame Labs for the original architecture design and implementation
	- Hugging Face for the Transformers library and training infrastructure
	- Claude and ChatGPT for assistance with documentation and code development

	This project builds upon research and tools from the open-source community. I am grateful for the collaborative spirit that makes projects like this possible.