|
--- |
|
library_name: transformers |
|
license: apache-2.0 |
|
pipeline_tag: text-to-speech |
|
--- |
|
|
|
# CSM-1B-HF |
|
|
|
## Sesame CSM 1B model weights for my [Hugging Face implementation](https://github.com/thomasgauthier/csm-hf/). |
|
|
|
--- |
|
|
|
## Overview |
|
|
|
CSM-HF is a Hugging Face implementation of [Sesame's Conversational Speech Model (CSM)](https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice). CSM-HF is a complete rewrite of the [pytorch code provided by Sesame](https://github.com/SesameAILabs/csm). This codebase is designed to be fully compatible with Hugging Face `transformers`, from inference to training. |
|
|
|
## Changes from Sesame's implementation |
|
|
|
- created a `CSMModel` class |
|
- replaced backbone and decoder torchtune models with HF transformers `LllamaModel` |
|
- added a processor class to prepare inputs for the model |
|
- added labels support and [decoder training amortization](https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice#:~:text=The%20audio%20decoder%20is%20trained%20on%20only%20a%20random%201/16%20subset%20of%20the%20audio%20frames%2C%20while%20the%20zeroth%20codebook%20is%20trained%20on%20every%20frame.) |
|
- added `generate_frame` and `generate` methods to the model class for generating audio |
|
- full support for HuggingFace `Trainer` |
|
|
|
## Generation |
|
|
|
You can use the model to generate audio from text input. Here's an example for voice cloning: |
|
|
|
```python |
|
import torch |
|
from modeling_csm import CSMModel |
|
from huggingface_hub import hf_hub_download |
|
from transformers import AutoTokenizer |
|
from tokenizers.processors import TemplateProcessing |
|
from moshi.models import loaders |
|
from processor import CSMProcessor |
|
import torchaudio |
|
|
|
device = 'cuda' |
|
|
|
def load_llama3_tokenizer(): |
|
""" |
|
https://github.com/huggingface/transformers/issues/22794#issuecomment-2092623992 |
|
""" |
|
tokenizer_name = "meta-llama/Llama-3.2-1B" |
|
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name) |
|
bos = tokenizer.bos_token |
|
eos = tokenizer.eos_token |
|
tokenizer._tokenizer.post_processor = TemplateProcessing( |
|
single=f"{bos}:0 $A:0 {eos}:0", |
|
pair=f"{bos}:0 $A:0 {eos}:0 {bos}:1 $B:1 {eos}:1", |
|
special_tokens=[(f"{bos}", tokenizer.bos_token_id), (f"{eos}", tokenizer.eos_token_id)], |
|
) |
|
|
|
return tokenizer |
|
|
|
text_tokenizer = load_llama3_tokenizer() |
|
|
|
mimi_weight = hf_hub_download(loaders.DEFAULT_REPO, loaders.MIMI_NAME) |
|
audio_tokenizer = loaders.get_mimi(mimi_weight, device=device) |
|
audio_tokenizer.set_num_codebooks(32) |
|
|
|
processor = CSMProcessor(text_tokenizer, audio_tokenizer) |
|
|
|
|
|
def load_audio(path, target_sr): |
|
audio, sr = torchaudio.load(path) |
|
audio = audio.squeeze(0) |
|
if sr != target_sr: |
|
audio = torchaudio.functional.resample(audio, orig_freq=sr, new_freq=target_sr) |
|
return audio |
|
|
|
|
|
model = CSMModel.from_pretrained("thomasgauthier/csm-1b-hf", torch_dtype=torch.bfloat16) |
|
model.to('cuda') |
|
|
|
|
|
inputs = processor( |
|
messages=[ |
|
{ |
|
"role": "speaker_0", |
|
"content": [ |
|
{"type": "text", "text": "<AUDIO_CLIP_TRANSCRIPT>"}, |
|
{"type": "audio"} # This placeholder is required for audio tokenization (it maps to the first element in the `audios` list passed to the processor) |
|
] |
|
}, |
|
{ |
|
"role": "speaker_0", |
|
"content": [ |
|
{"type": "text", "text": "Hello, this is voice cloning speaking"}, |
|
# does not include audio as the model will generate it |
|
] |
|
} |
|
], |
|
audios=[load_audio('AUDIO_CLIP_FOR_VOICE_CLONING.wav', audio_tokenizer.sample_rate)], |
|
return_tensors="pt" |
|
) |
|
|
|
import torch |
|
|
|
with torch.inference_mode(): |
|
# Generate up to 50 new frames |
|
gen_frames = model.generate( |
|
input_ids=inputs['input_ids'].cuda(), |
|
attention_mask=inputs['attention_mask'].cuda(), |
|
max_new_frames=50, |
|
topk=50, |
|
temperature=1.0, |
|
use_cache=True, |
|
stop_on_all_zeros=True, |
|
|
|
) |
|
|
|
decoded_audio = audio_tokenizer.decode(gen_frames.permute(0, 2, 1)).squeeze(0).squeeze(0) |
|
|
|
audio_array = (decoded_audio * 32768).to(torch.int16).cpu().numpy() |
|
|
|
# Audio can be played with the following code: |
|
# from IPython.display import Audio |
|
# Audio(audio_array, rate=audio_tokenizer.sample_rate) |
|
``` |
|
|
|
## Architecture |
|
|
|
Model architecture is discussed in [ARCHITECTURE.md](https://github.com/thomasgauthier/csm-hf/blob/main/ARCHITECTURE.md) (written by O1) |
|
|
|
## Training |
|
|
|
### Data Format |
|
|
|
CSM-HF expects training data in a JSONL format, where each line is a JSON object containing a conversation. Each conversation consists of: |
|
|
|
- `messages`: An array of message objects, each with: |
|
- `role`: Speaker identifier (e.g., "speaker_0", "speaker_1") |
|
- `content`: Array of content objects, which can be: |
|
- Text: `{"type": "text", "text": "The message text"}` |
|
- Audio: `{"type": "audio", "url": "path/to/audio/file.wav"}` |
|
- `training_mask`: Boolean array indicating which messages should be used for training (true) or context (false) |
|
|
|
Example data format: |
|
|
|
```json |
|
{ |
|
"messages": [ |
|
{ |
|
"role": "speaker_0", |
|
"content": [ |
|
{"type": "text", "text": "We have a chance for a new life here."}, |
|
{"type": "audio", "url": "clips/example_audio.wav"} |
|
] |
|
}, |
|
{ |
|
"role": "speaker_1", |
|
"content": [ |
|
{"type": "text", "text": "Uncle?"}, |
|
{"type": "audio", "url": "clips/response_audio.wav"} |
|
] |
|
} |
|
], |
|
"training_mask": [false, true] |
|
} |
|
``` |
|
|
|
### Training Process |
|
|
|
The model uses a two-stage autoregressive architecture: |
|
|
|
1. **Backbone (Inter-frame Processing)**: |
|
- Processes the entire sequence of frames |
|
- Each frame represents a combined embedding of all codebooks |
|
- Handles long-range dependencies between utterances |
|
|
|
2. **Decoder (Intra-frame Processing)**: |
|
- Processes a single frame at a time |
|
- Generates 32 codebooks sequentially (1 semantic + 31 acoustic) |
|
- Each codebook is treated as a token in the sequence |
|
|
|
Training leverages compute amortization techniques: |
|
- The zeroth (semantic) codebook is trained on all frames |
|
- The remaining codebooks (1-31) are trained on only `amortization_ratio` of the frames |
|
- This significantly reduces memory usage while maintaining quality |
|
|
|
To train the model: |
|
|
|
```bash |
|
python train.py \ |
|
--train_file path/to/training_data.jsonl \ |
|
--output_dir ./output \ |
|
--num_train_epochs 3 \ |
|
--per_device_train_batch_size 1 \ |
|
--gradient_accumulation_steps 8 \ |
|
--learning_rate 5e-6 |
|
``` |
|
|
|
|
|
## TODO |
|
|
|
- [x] Two-stage autoregressive architecture implementation |
|
- [x] Multi-codebook audio tokenization |
|
- [x] Compute amortization for efficient training |
|
- [x] Dataset preparation with interleaved text/audio |
|
- [x] Custom training loop with separate backbone/decoder losses |
|
- [x] Proper handling of epoch repetition for decoder amortization |
|
- [x] Memory optimization techniques (mixed precision, gradient accumulation) |
|
- [ ] LoRA support for efficient fine-tuning |
|
- [ ] Faster inference with `torch.compile` |
|
- [ ] Coice cloning with prompt tuning / prefix optimization |
|
- [ ] Support for DPO |
|
- [ ] Support for RL (GRPO, RLOO, etc.) |
|
|
|
## Acknowledgements |
|
|
|
Special thanks to: |
|
- **Sesame Labs** for the original architecture design and implementation |
|
- **Hugging Face** for the Transformers library and training infrastructure |
|
- **Claude** and **ChatGPT** for assistance with documentation and code development |
|
|
|
This project builds upon research and tools from the open-source community. I am grateful for the collaborative spirit that makes projects like this possible. |