# Accessible Speech Recognition: Fine‑tune Voxtral on Your Own Voice
Building speech technology that understands everyone is an accessibility imperative. If you have a speech impediment (e.g., stutter, dysarthria, apraxia) or a heavy accent, mainstream ASR systems can struggle. This app lets you fine‑tune the Voxtral ASR model on your own voice so it adapts to your unique speaking style — improving recognition accuracy and unlocking more inclusive voice experiences.
## Who this helps
- **People with speech differences**: Personalized models that reduce error rates on your voice
- **Accented speakers**: Adapt Voxtral to your accent and vocabulary
- **Educators/clinicians**: Create tailored recognition models for communication support
- **Product teams**: Prototype inclusive voice features with real users quickly
## What you get
- **Record or upload audio** and create a JSONL dataset in a few clicks
- **One‑click training** with full fine‑tuning or LoRA for efficiency
- **Automatic publishing** to Hugging Face Hub with a generated model card
- **Instant demo deployment** to HF Spaces for shareable, live ASR
## How it works (at a glance)
```mermaid
graph TD
%% Main Entry Point
START([🎯 Voxtral ASR Fine-tuning App]) --> OVERVIEW{Choose Documentation}
%% Documentation Categories
OVERVIEW --> ARCH[🏗️ Architecture Overview]
OVERVIEW --> WORKFLOW[🔄 Interface Workflow]
OVERVIEW --> TRAINING[🚀 Training Pipeline]
OVERVIEW --> DEPLOYMENT[🌐 Deployment Pipeline]
OVERVIEW --> DATAFLOW[📊 Data Flow]
%% Architecture Section
ARCH --> ARCH_DIAG[High-level Architecture
System Components & Layers]
ARCH --> ARCH_LINK[📄 View Details →](architecture.md)
%% Interface Section
WORKFLOW --> WORKFLOW_DIAG[User Journey
Recording → Training → Demo]
WORKFLOW --> WORKFLOW_LINK[📄 View Details →](interface-workflow.md)
%% Training Section
TRAINING --> TRAINING_DIAG[Training Scripts
Data → Model → Results]
TRAINING --> TRAINING_LINK[📄 View Details →](training-pipeline.md)
%% Deployment Section
DEPLOYMENT --> DEPLOYMENT_DIAG[Publishing & Demo
Model → Hub → Space]
DEPLOYMENT --> DEPLOYMENT_LINK[📄 View Details →](deployment-pipeline.md)
%% Data Flow Section
DATAFLOW --> DATAFLOW_DIAG[Complete Data Journey
Input → Processing → Output]
DATAFLOW --> DATAFLOW_LINK[📄 View Details →](data-flow.md)
%% Key Components Highlight
subgraph "🎛️ Core Components"
INTERFACE[interface.py
Gradio Web UI]
TRAIN_SCRIPTS[scripts/train*.py
Training Scripts]
DEPLOY_SCRIPT[scripts/deploy_demo_space.py
Demo Deployment]
PUSH_SCRIPT[scripts/push_to_huggingface.py
Model Publishing]
end
%% Data Flow Highlight
subgraph "📁 Key Data Formats"
JSONL[JSONL Dataset
{"audio_path": "...", "text": "..."}]
HFDATA[HF Hub Models
username/model-name]
SPACES[HF Spaces
Interactive Demos]
end
%% Connect components to their respective docs
INTERFACE --> WORKFLOW
TRAIN_SCRIPTS --> TRAINING
DEPLOY_SCRIPT --> DEPLOYMENT
PUSH_SCRIPT --> DEPLOYMENT
JSONL --> DATAFLOW
HFDATA --> DEPLOYMENT
SPACES --> DEPLOYMENT
%% Styling
classDef entry fill:#e3f2fd,stroke:#1976d2,stroke-width:3px
classDef category fill:#fff3e0,stroke:#f57c00,stroke-width:2px
classDef diagram fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
classDef link fill:#fce4ec,stroke:#c2185b,stroke-width:2px
classDef component fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
classDef data fill:#e1f5fe,stroke:#0277bd,stroke-width:2px
class START entry
class OVERVIEW,ARCH,WORKFLOW,TRAINING,DEPLOYMENT,DATAFLOW category
class ARCH_DIAG,WORKFLOW_DIAG,TRAINING_DIAG,DEPLOYMENT_DIAG,DATAFLOW_DIAG diagram
class ARCH_LINK,WORKFLOW_LINK,TRAINING_LINK,DEPLOYMENT_LINK,DATAFLOW_LINK link
class INTERFACE,TRAIN_SCRIPTS,DEPLOY_SCRIPT,PUSH_SCRIPT component
class JSONL,HFDATA,SPACES data
```
See the interactive diagram page for printing and quick navigation: [Interactive diagrams](diagrams.html).
## Quick start
### 1) Install
```bash
git clone https://github.com/Deep-unlearning/Finetune-Voxtral-ASR.git
cd Finetune-Voxtral-ASR
```
Use UV (recommended) or pip.
```bash
# UV
uv venv .venv --python 3.10 && source .venv/bin/activate
uv pip install -r requirements.txt
# or pip
python -m venv .venv --python 3.10 && source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
```
### 2) Launch the interface
```bash
python interface.py
```
The Gradio app guides you through language selection, recording or uploading audio, dataset creation, and training.
## Create your voice dataset (UI)
```mermaid
stateDiagram-v2
[*] --> LanguageSelection: User opens interface
state "Language & Dataset Setup" as LangSetup {
[*] --> LanguageSelection
LanguageSelection --> LoadPhrases: Select language
LoadPhrases --> DisplayPhrases: Load from NVIDIA Granary
DisplayPhrases --> RecordingInterface: Show phrases & recording UI
state RecordingInterface {
[*] --> ShowInitialRows: Display first 10 phrases
ShowInitialRows --> RecordAudio: User can record audio
RecordAudio --> AddMoreRows: Optional - add 10 more rows
AddMoreRows --> RecordAudio
}
}
RecordingInterface --> DatasetCreation: User finishes recording
state "Dataset Creation Options" as DatasetCreation {
[*] --> FromRecordings: Create from recorded audio
[*] --> FromUploads: Upload existing files
FromRecordings --> ProcessRecordings: Save WAV files + transcripts
FromUploads --> ProcessUploads: Process uploaded files + transcripts
ProcessRecordings --> CreateJSONL: Generate JSONL dataset
ProcessUploads --> CreateJSONL
CreateJSONL --> DatasetReady: Dataset saved locally
}
DatasetCreation --> TrainingConfiguration: Dataset ready
state "Training Setup" as TrainingConfiguration {
[*] --> BasicSettings: Model, LoRA/full, batch size
[*] --> AdvancedSettings: Learning rate, epochs, LoRA params
BasicSettings --> ConfigureDeployment: Repo name, push options
AdvancedSettings --> ConfigureDeployment
ConfigureDeployment --> StartTraining: All settings configured
}
TrainingConfiguration --> TrainingProcess: Start training
state "Training Process" as TrainingProcess {
[*] --> InitializeTrackio: Setup experiment tracking
InitializeTrackio --> RunTrainingScript: Execute train.py or train_lora.py
RunTrainingScript --> StreamLogs: Show real-time training logs
StreamLogs --> MonitorProgress: Track metrics & checkpoints
MonitorProgress --> TrainingComplete: Training finished
MonitorProgress --> HandleErrors: Training failed
HandleErrors --> RetryOrExit: User can retry or exit
}
TrainingProcess --> PostTraining: Training complete
state "Post-Training Actions" as PostTraining {
[*] --> PushToHub: Push model to HF Hub
[*] --> GenerateModelCard: Create model card
[*] --> DeployDemoSpace: Deploy interactive demo
PushToHub --> ModelPublished: Model available on HF Hub
GenerateModelCard --> ModelDocumented: Model card created
DeployDemoSpace --> DemoReady: Demo space deployed
}
PostTraining --> [*]: Process complete
%% Alternative paths
DatasetCreation --> PushDatasetOnly: Skip training, push dataset only
PushDatasetOnly --> DatasetPublished: Dataset on HF Hub
%% Error handling
TrainingProcess --> ErrorRecovery: Handle training errors
ErrorRecovery --> RetryTraining: Retry with different settings
RetryTraining --> TrainingConfiguration
%% Styling and notes
note right of LanguageSelection : User selects language for\n authentic phrases from\n NVIDIA Granary dataset
note right of RecordingInterface : Users record themselves\n reading displayed phrases
note right of DatasetCreation : JSONL format: {"audio_path": "...", "text": "..."}
note right of TrainingConfiguration : Configure LoRA parameters,\n learning rate, epochs, etc.
note right of TrainingProcess : Real-time log streaming\n with Trackio integration
note right of PostTraining : Automated deployment\n pipeline
```
Steps you’ll follow in the UI:
- **Choose language**: Select a language for authentic phrases (from NVIDIA Granary)
- **Record or upload**: Capture your voice or provide existing audio + transcripts
- **Create dataset**: The app writes a JSONL file with entries like `{ "audio_path": ..., "text": ... }`
- **Configure training**: Pick base model, LoRA vs full, batch size and learning rate
- **Run training**: Watch live logs and metrics; resume on error if needed
- **Publish & deploy**: Push to HF Hub and one‑click deploy an interactive Space
## Train your personalized Voxtral model
Under the hood, training uses Hugging Face Trainer and a custom `VoxtralDataCollator` that builds Voxtral/LLaMA‑style prompts and masks the prompt tokens so loss is computed only on the transcription.
```mermaid
graph TB
%% Input Data Sources
subgraph "Data Sources"
JSONL[JSONL Dataset
{"audio_path": "...", "text": "..."}]
GRANARY[NVIDIA Granary Dataset
Multilingual ASR Data]
HFDATA[HF Hub Datasets
Community Datasets]
end
%% Data Processing
subgraph "Data Processing"
LOADER[Dataset Loader
_load_jsonl_dataset()]
CASTER[Audio Casting
16kHz resampling]
COLLATOR[VoxtralDataCollator
Audio + Text Processing]
end
%% Training Scripts
subgraph "Training Scripts"
TRAIN_FULL[Full Fine-tuning
scripts/train.py]
TRAIN_LORA[LoRA Fine-tuning
scripts/train_lora.py]
subgraph "Training Components"
MODEL_INIT[Model Initialization
VoxtralForConditionalGeneration]
LORA_CONFIG[LoRA Configuration
LoraConfig + get_peft_model]
PROCESSOR_INIT[Processor Initialization
VoxtralProcessor]
end
end
%% Training Infrastructure
subgraph "Training Infrastructure"
TRACKIO_INIT[Trackio Integration
Experiment Tracking]
HF_TRAINER[Hugging Face Trainer
TrainingArguments + Trainer]
TORCH_DEVICE[Torch Device Setup
GPU/CPU Detection]
end
%% Training Process
subgraph "Training Process"
FORWARD_PASS[Forward Pass
Audio Processing + Generation]
LOSS_CALC[Loss Calculation
Masked Language Modeling]
BACKWARD_PASS[Backward Pass
Gradient Computation]
OPTIMIZER_STEP[Optimizer Step
Parameter Updates]
LOGGING[Metrics Logging
Loss, Perplexity, etc.]
end
%% Model Management
subgraph "Model Management"
CHECKPOINT_SAVING[Checkpoint Saving
Model snapshots]
MODEL_SAVING[Final Model Saving
Processor + Model]
LOCAL_STORAGE[Local Storage
outputs/ directory]
end
%% Flow Connections
JSONL --> LOADER
GRANARY --> LOADER
HFDATA --> LOADER
LOADER --> CASTER
CASTER --> COLLATOR
COLLATOR --> TRAIN_FULL
COLLATOR --> TRAIN_LORA
TRAIN_FULL --> MODEL_INIT
TRAIN_LORA --> MODEL_INIT
TRAIN_LORA --> LORA_CONFIG
MODEL_INIT --> PROCESSOR_INIT
LORA_CONFIG --> PROCESSOR_INIT
PROCESSOR_INIT --> TRACKIO_INIT
PROCESSOR_INIT --> HF_TRAINER
PROCESSOR_INIT --> TORCH_DEVICE
TRACKIO_INIT --> HF_TRAINER
TORCH_DEVICE --> HF_TRAINER
HF_TRAINER --> FORWARD_PASS
FORWARD_PASS --> LOSS_CALC
LOSS_CALC --> BACKWARD_PASS
BACKWARD_PASS --> OPTIMIZER_STEP
OPTIMIZER_STEP --> LOGGING
LOGGING --> CHECKPOINT_SAVING
LOGGING --> TRACKIO_INIT
HF_TRAINER --> MODEL_SAVING
MODEL_SAVING --> LOCAL_STORAGE
%% Styling
classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
classDef processing fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
classDef training fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
classDef infrastructure fill:#fff3e0,stroke:#f57c00,stroke-width:2px
classDef execution fill:#fce4ec,stroke:#c2185b,stroke-width:2px
classDef output fill:#f5f5f5,stroke:#424242,stroke-width:2px
class JSONL,GRANARY,HFDATA input
class LOADER,CASTER,COLLATOR processing
class TRAIN_FULL,TRAIN_LORA,MODEL_INIT,LORA_CONFIG,PROCESSOR_INIT training
class TRACKIO_INIT,HF_TRAINER,TORCH_DEVICE infrastructure
class FORWARD_PASS,LOSS_CALC,BACKWARD_PASS,OPTIMIZER_STEP,LOGGING execution
class CHECKPOINT_SAVING,MODEL_SAVING,LOCAL_STORAGE output
```
CLI alternatives (if you prefer the terminal):
```bash
# Full fine-tuning
uv run train.py
# Parameter‑efficient LoRA fine‑tuning (recommended for most users)
uv run train_lora.py
```
## Publish and deploy a live demo
After training, the app can push your model and metrics to the Hugging Face Hub and create an interactive Space demo automatically.
```mermaid
graph TB
%% Input Sources
subgraph "Inputs"
TRAINED_MODEL[Trained Model
Local directory]
TRAINING_CONFIG[Training Config
JSON/YAML]
TRAINING_RESULTS[Training Results
Metrics & logs]
MODEL_METADATA[Model Metadata
Name, description, etc.]
end
%% Model Publishing
subgraph "Model Publishing"
PUSH_SCRIPT[push_to_huggingface.py
Model Publisher]
subgraph "Publishing Steps"
REPO_CREATION[Repository Creation
HF Hub API]
FILE_UPLOAD[File Upload
Model files to HF]
METADATA_UPLOAD[Metadata Upload
Config & results]
end
end
%% Model Card Generation
subgraph "Model Card Generation"
CARD_SCRIPT[generate_model_card.py
Card Generator]
subgraph "Card Components"
TEMPLATE_LOAD[Template Loading
model_card.md]
VARIABLE_REPLACEMENT[Variable Replacement
Config injection]
CONDITIONAL_PROCESSING[Conditional Sections
Quantized models, etc.]
end
end
%% Demo Space Deployment
subgraph "Demo Space Deployment"
DEPLOY_SCRIPT[deploy_demo_space.py
Space Deployer]
subgraph "Space Setup"
SPACE_CREATION[Space Repository
Create HF Space]
TEMPLATE_COPY[Template Copying
demo_voxtral/ files]
ENV_INJECTION[Environment Setup
Model config injection]
SECRET_SETUP[Secret Configuration
HF_TOKEN, model vars]
end
end
%% Space Building & Testing
subgraph "Space Building"
BUILD_TRIGGER[Build Trigger
Automatic build start]
DEPENDENCY_INSTALL[Dependency Installation
requirements.txt]
MODEL_DOWNLOAD[Model Download
From HF Hub]
APP_INITIALIZATION[App Initialization
Gradio app setup]
end
%% Live Demo
subgraph "Live Demo Space"
GRADIO_INTERFACE[Gradio Interface
Interactive demo]
MODEL_INFERENCE[Model Inference
Real-time ASR]
USER_INTERACTION[User Interaction
Audio upload/playback]
end
%% External Services
subgraph "External Services"
HF_HUB[Hugging Face Hub
Model & Space hosting]
HF_SPACES[HF Spaces Platform
Demo hosting]
end
%% Flow Connections
TRAINED_MODEL --> PUSH_SCRIPT
TRAINING_CONFIG --> PUSH_SCRIPT
TRAINING_RESULTS --> PUSH_SCRIPT
MODEL_METADATA --> PUSH_SCRIPT
PUSH_SCRIPT --> REPO_CREATION
REPO_CREATION --> FILE_UPLOAD
FILE_UPLOAD --> METADATA_UPLOAD
METADATA_UPLOAD --> CARD_SCRIPT
TRAINING_CONFIG --> CARD_SCRIPT
TRAINING_RESULTS --> CARD_SCRIPT
CARD_SCRIPT --> TEMPLATE_LOAD
TEMPLATE_LOAD --> VARIABLE_REPLACEMENT
VARIABLE_REPLACEMENT --> CONDITIONAL_PROCESSING
CONDITIONAL_PROCESSING --> DEPLOY_SCRIPT
METADATA_UPLOAD --> DEPLOY_SCRIPT
DEPLOY_SCRIPT --> SPACE_CREATION
SPACE_CREATION --> TEMPLATE_COPY
TEMPLATE_COPY --> ENV_INJECTION
ENV_INJECTION --> SECRET_SETUP
SECRET_SETUP --> BUILD_TRIGGER
BUILD_TRIGGER --> DEPENDENCY_INSTALL
DEPENDENCY_INSTALL --> MODEL_DOWNLOAD
MODEL_DOWNLOAD --> APP_INITIALIZATION
APP_INITIALIZATION --> GRADIO_INTERFACE
GRADIO_INTERFACE --> MODEL_INFERENCE
MODEL_INFERENCE --> USER_INTERACTION
HF_HUB --> MODEL_DOWNLOAD
HF_SPACES --> GRADIO_INTERFACE
%% Styling
classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
classDef publishing fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
classDef generation fill:#fff3e0,stroke:#f57c00,stroke-width:2px
classDef deployment fill:#fce4ec,stroke:#c2185b,stroke-width:2px
classDef building fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
classDef demo fill:#e1f5fe,stroke:#0277bd,stroke-width:2px
classDef external fill:#f5f5f5,stroke:#424242,stroke-width:2px
class TRAINED_MODEL,TRAINING_CONFIG,TRAINING_RESULTS,MODEL_METADATA input
class PUSH_SCRIPT,REPO_CREATION,FILE_UPLOAD,METADATA_UPLOAD publishing
class CARD_SCRIPT,TEMPLATE_LOAD,VARIABLE_REPLACEMENT,CONDITIONAL_PROCESSING generation
class DEPLOY_SCRIPT,SPACE_CREATION,TEMPLATE_COPY,ENV_INJECTION,SECRET_SETUP deployment
class BUILD_TRIGGER,DEPENDENCY_INSTALL,MODEL_DOWNLOAD,APP_INITIALIZATION building
class GRADIO_INTERFACE,MODEL_INFERENCE,USER_INTERACTION demo
class HF_HUB,HF_SPACES external
```
## Why personalization improves accessibility
- **Your model learns your patterns**: tempo, prosody, phoneme realizations, disfluencies
- **Vocabulary and names**: teach domain terms and proper nouns you use often
- **Bias correction**: reduce systematic errors common to off‑the‑shelf ASR for your voice
- **Agency and privacy**: keep data local and only publish when you choose
## Practical tips
- **Start with LoRA**: Parameter‑efficient fine‑tuning is faster and uses less memory
- **Record diverse samples**: Different tempos, environments, and phrase lengths
- **Short sessions**: Many shorter clips beat a few long ones for learning
- **Check transcripts**: Clean, accurate transcripts improve outcomes
## Learn more
- [Repository README](../README.md)
- [Documentation Overview](README.md)
- [Architecture Overview](architecture.md)
- [Interface Workflow](interface-workflow.md)
- [Training Pipeline](training-pipeline.md)
- [Deployment Pipeline](deployment-pipeline.md)
- [Data Flow](data-flow.md)
- [Interactive Diagrams](diagrams.html)
---
This project exists to make voice technology work better for everyone. If you build a model that helps you — or your community — consider sharing a demo so others can learn from it.