--- title: VoxFactory emoji: 🌬️ colorFrom: gray colorTo: red sdk: gradio app_file: interface.py pinned: false license: mit short_description: FinetuneASR Voxtral --- # Finetune Voxtral for ASR with Transformers 🤗 This repository fine-tunes the [Voxtral](https://huggingface.co/Deep-unlearning/Voxtral) speech model on conversational speech datasets using the Hugging Face `transformers` and `datasets` libraries. ## Installation ### Step 1: Clone the repository ```bash git clone https://github.com/Deep-unlearning/Finetune-Voxtral-ASR.git cd Finetune-Voxtral-ASR ``` ### Step 2: Set up environment Choose your preferred package manager:
📦 Using UV (recommended) [Install `uv`](https://docs.astral.sh/uv/getting-started/installation/) ```bash uv venv .venv --python 3.10 && source .venv/bin/activate uv pip install -r requirements.txt ```
🐍 Using pip ```bash python -m venv .venv --python 3.10 && source .venv/bin/activate pip install --upgrade pip pip install -r requirements.txt ```
## Dataset Preparation Perfect — here’s a **drop-in replacement** for your README’s “Dataset Preparation” that matches your script (uses **`hf-audio/esb-datasets-test-only-sorted`** with the **`voxpopuli`** config, 16 kHz casting, and a small train/eval slice), and explains the Voxtral/LLaMA-style prompt+label masking your collator implements. --- ## Dataset Preparation For ASR fine-tuning, inputs look like: * **Inputs**: `[AUDIO] … [AUDIO] ` * **Labels**: same sequence, but the prefix `[AUDIO] … [AUDIO] ` is **masked with `-100`** so loss is computed **only** on the transcription tokens. The `VoxtralDataCollator` already builds this sequence (prompt expansion via the processor and label masking). The dataset only needs two fields: ```python { "audio": {"array": , "sampling_rate": 16000, ...}, "text": "" } ``` If you want to swap to a different dataset, ensure after loading you still have: * an **`audio`** column (cast to `Audio(sampling_rate=16000)`), and * a **`text`** column (the reference transcription). If your dataset uses different column names, map them to `audio` and `text` before returning. ## Training Run the training script: ```bash uv run train.py ``` Logs and checkpoints will be saved under the `outputs/` directory by default. ## Training with LoRA You can also run the training script with LoRA: ```bash uv run train_lora.py ``` **Happy fine-tuning Voxtral!** 🚀