Spaces:
Running
A newer version of the Gradio SDK is available:
5.44.1
title: VoxFactory
emoji: 🌬️
colorFrom: gray
colorTo: red
sdk: gradio
app_file: interface.py
pinned: false
license: mit
short_description: FinetuneASR Voxtral
Finetune Voxtral for ASR with Transformers 🤗
This repository fine-tunes the Voxtral speech model on conversational speech datasets using the Hugging Face transformers
and datasets
libraries.
Installation
Step 1: Clone the repository
git clone https://github.com/Deep-unlearning/Finetune-Voxtral-ASR.git
cd Finetune-Voxtral-ASR
Step 2: Set up environment
Choose your preferred package manager:
📦 Using UV (recommended)
uv venv .venv --python 3.10 && source .venv/bin/activate
uv pip install -r requirements.txt
🐍 Using pip
python -m venv .venv --python 3.10 && source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
Dataset Preparation
Perfect — here’s a drop-in replacement for your README’s “Dataset Preparation” that matches your script (uses hf-audio/esb-datasets-test-only-sorted
with the voxpopuli
config, 16 kHz casting, and a small train/eval slice), and explains the Voxtral/LLaMA-style prompt+label masking your collator implements.
Dataset Preparation
For ASR fine-tuning, inputs look like:
- Inputs:
[AUDIO] … [AUDIO] <transcribe> <reference transcription>
- Labels: same sequence, but the prefix
[AUDIO] … [AUDIO] <transcribe>
is masked with-100
so loss is computed only on the transcription tokens.
The VoxtralDataCollator
already builds this sequence (prompt expansion via the processor and label masking).
The dataset only needs two fields:
{
"audio": {"array": <float32 numpy array>, "sampling_rate": 16000, ...},
"text": "<reference transcription>"
}
If you want to swap to a different dataset, ensure after loading you still have:
- an
audio
column (cast toAudio(sampling_rate=16000)
), and - a
text
column (the reference transcription).
If your dataset uses different column names, map them to audio
and text
before returning.
Training
Run the training script:
uv run train.py
Logs and checkpoints will be saved under the outputs/
directory by default.
Training with LoRA
You can also run the training script with LoRA:
uv run train_lora.py
Happy fine-tuning Voxtral! 🚀