# Accessible Speech Recognition: Fine‑tune Voxtral on Your Own Voice Building speech technology that understands everyone is an accessibility imperative. If you have a speech impediment (e.g., stutter, dysarthria, apraxia) or a heavy accent, mainstream ASR systems can struggle. This app lets you fine‑tune the Voxtral ASR model on your own voice so it adapts to your unique speaking style — improving recognition accuracy and unlocking more inclusive voice experiences. ## Who this helps - **People with speech differences**: Personalized models that reduce error rates on your voice - **Accented speakers**: Adapt Voxtral to your accent and vocabulary - **Educators/clinicians**: Create tailored recognition models for communication support - **Product teams**: Prototype inclusive voice features with real users quickly ## What you get - **Record or upload audio** and create a JSONL dataset in a few clicks - **One‑click training** with full fine‑tuning or LoRA for efficiency - **Automatic publishing** to Hugging Face Hub with a generated model card - **Instant demo deployment** to HF Spaces for shareable, live ASR ## How it works (at a glance) ```mermaid graph TD %% Main Entry Point START([🎯 Voxtral ASR Fine-tuning App]) --> OVERVIEW{Choose Documentation} %% Documentation Categories OVERVIEW --> ARCH[🏗️ Architecture Overview] OVERVIEW --> WORKFLOW[🔄 Interface Workflow] OVERVIEW --> TRAINING[🚀 Training Pipeline] OVERVIEW --> DEPLOYMENT[🌐 Deployment Pipeline] OVERVIEW --> DATAFLOW[📊 Data Flow] %% Architecture Section ARCH --> ARCH_DIAG[High-level Architecture
System Components & Layers] ARCH --> ARCH_LINK[📄 View Details →](architecture.md) %% Interface Section WORKFLOW --> WORKFLOW_DIAG[User Journey
Recording → Training → Demo] WORKFLOW --> WORKFLOW_LINK[📄 View Details →](interface-workflow.md) %% Training Section TRAINING --> TRAINING_DIAG[Training Scripts
Data → Model → Results] TRAINING --> TRAINING_LINK[📄 View Details →](training-pipeline.md) %% Deployment Section DEPLOYMENT --> DEPLOYMENT_DIAG[Publishing & Demo
Model → Hub → Space] DEPLOYMENT --> DEPLOYMENT_LINK[📄 View Details →](deployment-pipeline.md) %% Data Flow Section DATAFLOW --> DATAFLOW_DIAG[Complete Data Journey
Input → Processing → Output] DATAFLOW --> DATAFLOW_LINK[📄 View Details →](data-flow.md) %% Key Components Highlight subgraph "🎛️ Core Components" INTERFACE[interface.py
Gradio Web UI] TRAIN_SCRIPTS[scripts/train*.py
Training Scripts] DEPLOY_SCRIPT[scripts/deploy_demo_space.py
Demo Deployment] PUSH_SCRIPT[scripts/push_to_huggingface.py
Model Publishing] end %% Data Flow Highlight subgraph "📁 Key Data Formats" JSONL[JSONL Dataset
{"audio_path": "...", "text": "..."}] HFDATA[HF Hub Models
username/model-name] SPACES[HF Spaces
Interactive Demos] end %% Connect components to their respective docs INTERFACE --> WORKFLOW TRAIN_SCRIPTS --> TRAINING DEPLOY_SCRIPT --> DEPLOYMENT PUSH_SCRIPT --> DEPLOYMENT JSONL --> DATAFLOW HFDATA --> DEPLOYMENT SPACES --> DEPLOYMENT %% Styling classDef entry fill:#e3f2fd,stroke:#1976d2,stroke-width:3px classDef category fill:#fff3e0,stroke:#f57c00,stroke-width:2px classDef diagram fill:#e8f5e8,stroke:#388e3c,stroke-width:2px classDef link fill:#fce4ec,stroke:#c2185b,stroke-width:2px classDef component fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px classDef data fill:#e1f5fe,stroke:#0277bd,stroke-width:2px class START entry class OVERVIEW,ARCH,WORKFLOW,TRAINING,DEPLOYMENT,DATAFLOW category class ARCH_DIAG,WORKFLOW_DIAG,TRAINING_DIAG,DEPLOYMENT_DIAG,DATAFLOW_DIAG diagram class ARCH_LINK,WORKFLOW_LINK,TRAINING_LINK,DEPLOYMENT_LINK,DATAFLOW_LINK link class INTERFACE,TRAIN_SCRIPTS,DEPLOY_SCRIPT,PUSH_SCRIPT component class JSONL,HFDATA,SPACES data ``` See the interactive diagram page for printing and quick navigation: [Interactive diagrams](diagrams.html). ## Quick start ### 1) Install ```bash git clone https://github.com/Deep-unlearning/Finetune-Voxtral-ASR.git cd Finetune-Voxtral-ASR ``` Use UV (recommended) or pip. ```bash # UV uv venv .venv --python 3.10 && source .venv/bin/activate uv pip install -r requirements.txt # or pip python -m venv .venv --python 3.10 && source .venv/bin/activate pip install --upgrade pip pip install -r requirements.txt ``` ### 2) Launch the interface ```bash python interface.py ``` The Gradio app guides you through language selection, recording or uploading audio, dataset creation, and training. ## Create your voice dataset (UI) ```mermaid stateDiagram-v2 [*] --> LanguageSelection: User opens interface state "Language & Dataset Setup" as LangSetup { [*] --> LanguageSelection LanguageSelection --> LoadPhrases: Select language LoadPhrases --> DisplayPhrases: Load from NVIDIA Granary DisplayPhrases --> RecordingInterface: Show phrases & recording UI state RecordingInterface { [*] --> ShowInitialRows: Display first 10 phrases ShowInitialRows --> RecordAudio: User can record audio RecordAudio --> AddMoreRows: Optional - add 10 more rows AddMoreRows --> RecordAudio } } RecordingInterface --> DatasetCreation: User finishes recording state "Dataset Creation Options" as DatasetCreation { [*] --> FromRecordings: Create from recorded audio [*] --> FromUploads: Upload existing files FromRecordings --> ProcessRecordings: Save WAV files + transcripts FromUploads --> ProcessUploads: Process uploaded files + transcripts ProcessRecordings --> CreateJSONL: Generate JSONL dataset ProcessUploads --> CreateJSONL CreateJSONL --> DatasetReady: Dataset saved locally } DatasetCreation --> TrainingConfiguration: Dataset ready state "Training Setup" as TrainingConfiguration { [*] --> BasicSettings: Model, LoRA/full, batch size [*] --> AdvancedSettings: Learning rate, epochs, LoRA params BasicSettings --> ConfigureDeployment: Repo name, push options AdvancedSettings --> ConfigureDeployment ConfigureDeployment --> StartTraining: All settings configured } TrainingConfiguration --> TrainingProcess: Start training state "Training Process" as TrainingProcess { [*] --> InitializeTrackio: Setup experiment tracking InitializeTrackio --> RunTrainingScript: Execute train.py or train_lora.py RunTrainingScript --> StreamLogs: Show real-time training logs StreamLogs --> MonitorProgress: Track metrics & checkpoints MonitorProgress --> TrainingComplete: Training finished MonitorProgress --> HandleErrors: Training failed HandleErrors --> RetryOrExit: User can retry or exit } TrainingProcess --> PostTraining: Training complete state "Post-Training Actions" as PostTraining { [*] --> PushToHub: Push model to HF Hub [*] --> GenerateModelCard: Create model card [*] --> DeployDemoSpace: Deploy interactive demo PushToHub --> ModelPublished: Model available on HF Hub GenerateModelCard --> ModelDocumented: Model card created DeployDemoSpace --> DemoReady: Demo space deployed } PostTraining --> [*]: Process complete %% Alternative paths DatasetCreation --> PushDatasetOnly: Skip training, push dataset only PushDatasetOnly --> DatasetPublished: Dataset on HF Hub %% Error handling TrainingProcess --> ErrorRecovery: Handle training errors ErrorRecovery --> RetryTraining: Retry with different settings RetryTraining --> TrainingConfiguration %% Styling and notes note right of LanguageSelection : User selects language for\n authentic phrases from\n NVIDIA Granary dataset note right of RecordingInterface : Users record themselves\n reading displayed phrases note right of DatasetCreation : JSONL format: {"audio_path": "...", "text": "..."} note right of TrainingConfiguration : Configure LoRA parameters,\n learning rate, epochs, etc. note right of TrainingProcess : Real-time log streaming\n with Trackio integration note right of PostTraining : Automated deployment\n pipeline ``` Steps you’ll follow in the UI: - **Choose language**: Select a language for authentic phrases (from NVIDIA Granary) - **Record or upload**: Capture your voice or provide existing audio + transcripts - **Create dataset**: The app writes a JSONL file with entries like `{ "audio_path": ..., "text": ... }` - **Configure training**: Pick base model, LoRA vs full, batch size and learning rate - **Run training**: Watch live logs and metrics; resume on error if needed - **Publish & deploy**: Push to HF Hub and one‑click deploy an interactive Space ## Train your personalized Voxtral model Under the hood, training uses Hugging Face Trainer and a custom `VoxtralDataCollator` that builds Voxtral/LLaMA‑style prompts and masks the prompt tokens so loss is computed only on the transcription. ```mermaid graph TB %% Input Data Sources subgraph "Data Sources" JSONL[JSONL Dataset
{"audio_path": "...", "text": "..."}] GRANARY[NVIDIA Granary Dataset
Multilingual ASR Data] HFDATA[HF Hub Datasets
Community Datasets] end %% Data Processing subgraph "Data Processing" LOADER[Dataset Loader
_load_jsonl_dataset()] CASTER[Audio Casting
16kHz resampling] COLLATOR[VoxtralDataCollator
Audio + Text Processing] end %% Training Scripts subgraph "Training Scripts" TRAIN_FULL[Full Fine-tuning
scripts/train.py] TRAIN_LORA[LoRA Fine-tuning
scripts/train_lora.py] subgraph "Training Components" MODEL_INIT[Model Initialization
VoxtralForConditionalGeneration] LORA_CONFIG[LoRA Configuration
LoraConfig + get_peft_model] PROCESSOR_INIT[Processor Initialization
VoxtralProcessor] end end %% Training Infrastructure subgraph "Training Infrastructure" TRACKIO_INIT[Trackio Integration
Experiment Tracking] HF_TRAINER[Hugging Face Trainer
TrainingArguments + Trainer] TORCH_DEVICE[Torch Device Setup
GPU/CPU Detection] end %% Training Process subgraph "Training Process" FORWARD_PASS[Forward Pass
Audio Processing + Generation] LOSS_CALC[Loss Calculation
Masked Language Modeling] BACKWARD_PASS[Backward Pass
Gradient Computation] OPTIMIZER_STEP[Optimizer Step
Parameter Updates] LOGGING[Metrics Logging
Loss, Perplexity, etc.] end %% Model Management subgraph "Model Management" CHECKPOINT_SAVING[Checkpoint Saving
Model snapshots] MODEL_SAVING[Final Model Saving
Processor + Model] LOCAL_STORAGE[Local Storage
outputs/ directory] end %% Flow Connections JSONL --> LOADER GRANARY --> LOADER HFDATA --> LOADER LOADER --> CASTER CASTER --> COLLATOR COLLATOR --> TRAIN_FULL COLLATOR --> TRAIN_LORA TRAIN_FULL --> MODEL_INIT TRAIN_LORA --> MODEL_INIT TRAIN_LORA --> LORA_CONFIG MODEL_INIT --> PROCESSOR_INIT LORA_CONFIG --> PROCESSOR_INIT PROCESSOR_INIT --> TRACKIO_INIT PROCESSOR_INIT --> HF_TRAINER PROCESSOR_INIT --> TORCH_DEVICE TRACKIO_INIT --> HF_TRAINER TORCH_DEVICE --> HF_TRAINER HF_TRAINER --> FORWARD_PASS FORWARD_PASS --> LOSS_CALC LOSS_CALC --> BACKWARD_PASS BACKWARD_PASS --> OPTIMIZER_STEP OPTIMIZER_STEP --> LOGGING LOGGING --> CHECKPOINT_SAVING LOGGING --> TRACKIO_INIT HF_TRAINER --> MODEL_SAVING MODEL_SAVING --> LOCAL_STORAGE %% Styling classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px classDef processing fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px classDef training fill:#e8f5e8,stroke:#388e3c,stroke-width:2px classDef infrastructure fill:#fff3e0,stroke:#f57c00,stroke-width:2px classDef execution fill:#fce4ec,stroke:#c2185b,stroke-width:2px classDef output fill:#f5f5f5,stroke:#424242,stroke-width:2px class JSONL,GRANARY,HFDATA input class LOADER,CASTER,COLLATOR processing class TRAIN_FULL,TRAIN_LORA,MODEL_INIT,LORA_CONFIG,PROCESSOR_INIT training class TRACKIO_INIT,HF_TRAINER,TORCH_DEVICE infrastructure class FORWARD_PASS,LOSS_CALC,BACKWARD_PASS,OPTIMIZER_STEP,LOGGING execution class CHECKPOINT_SAVING,MODEL_SAVING,LOCAL_STORAGE output ``` CLI alternatives (if you prefer the terminal): ```bash # Full fine-tuning uv run train.py # Parameter‑efficient LoRA fine‑tuning (recommended for most users) uv run train_lora.py ``` ## Publish and deploy a live demo After training, the app can push your model and metrics to the Hugging Face Hub and create an interactive Space demo automatically. ```mermaid graph TB %% Input Sources subgraph "Inputs" TRAINED_MODEL[Trained Model
Local directory] TRAINING_CONFIG[Training Config
JSON/YAML] TRAINING_RESULTS[Training Results
Metrics & logs] MODEL_METADATA[Model Metadata
Name, description, etc.] end %% Model Publishing subgraph "Model Publishing" PUSH_SCRIPT[push_to_huggingface.py
Model Publisher] subgraph "Publishing Steps" REPO_CREATION[Repository Creation
HF Hub API] FILE_UPLOAD[File Upload
Model files to HF] METADATA_UPLOAD[Metadata Upload
Config & results] end end %% Model Card Generation subgraph "Model Card Generation" CARD_SCRIPT[generate_model_card.py
Card Generator] subgraph "Card Components" TEMPLATE_LOAD[Template Loading
model_card.md] VARIABLE_REPLACEMENT[Variable Replacement
Config injection] CONDITIONAL_PROCESSING[Conditional Sections
Quantized models, etc.] end end %% Demo Space Deployment subgraph "Demo Space Deployment" DEPLOY_SCRIPT[deploy_demo_space.py
Space Deployer] subgraph "Space Setup" SPACE_CREATION[Space Repository
Create HF Space] TEMPLATE_COPY[Template Copying
demo_voxtral/ files] ENV_INJECTION[Environment Setup
Model config injection] SECRET_SETUP[Secret Configuration
HF_TOKEN, model vars] end end %% Space Building & Testing subgraph "Space Building" BUILD_TRIGGER[Build Trigger
Automatic build start] DEPENDENCY_INSTALL[Dependency Installation
requirements.txt] MODEL_DOWNLOAD[Model Download
From HF Hub] APP_INITIALIZATION[App Initialization
Gradio app setup] end %% Live Demo subgraph "Live Demo Space" GRADIO_INTERFACE[Gradio Interface
Interactive demo] MODEL_INFERENCE[Model Inference
Real-time ASR] USER_INTERACTION[User Interaction
Audio upload/playback] end %% External Services subgraph "External Services" HF_HUB[Hugging Face Hub
Model & Space hosting] HF_SPACES[HF Spaces Platform
Demo hosting] end %% Flow Connections TRAINED_MODEL --> PUSH_SCRIPT TRAINING_CONFIG --> PUSH_SCRIPT TRAINING_RESULTS --> PUSH_SCRIPT MODEL_METADATA --> PUSH_SCRIPT PUSH_SCRIPT --> REPO_CREATION REPO_CREATION --> FILE_UPLOAD FILE_UPLOAD --> METADATA_UPLOAD METADATA_UPLOAD --> CARD_SCRIPT TRAINING_CONFIG --> CARD_SCRIPT TRAINING_RESULTS --> CARD_SCRIPT CARD_SCRIPT --> TEMPLATE_LOAD TEMPLATE_LOAD --> VARIABLE_REPLACEMENT VARIABLE_REPLACEMENT --> CONDITIONAL_PROCESSING CONDITIONAL_PROCESSING --> DEPLOY_SCRIPT METADATA_UPLOAD --> DEPLOY_SCRIPT DEPLOY_SCRIPT --> SPACE_CREATION SPACE_CREATION --> TEMPLATE_COPY TEMPLATE_COPY --> ENV_INJECTION ENV_INJECTION --> SECRET_SETUP SECRET_SETUP --> BUILD_TRIGGER BUILD_TRIGGER --> DEPENDENCY_INSTALL DEPENDENCY_INSTALL --> MODEL_DOWNLOAD MODEL_DOWNLOAD --> APP_INITIALIZATION APP_INITIALIZATION --> GRADIO_INTERFACE GRADIO_INTERFACE --> MODEL_INFERENCE MODEL_INFERENCE --> USER_INTERACTION HF_HUB --> MODEL_DOWNLOAD HF_SPACES --> GRADIO_INTERFACE %% Styling classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px classDef publishing fill:#e8f5e8,stroke:#388e3c,stroke-width:2px classDef generation fill:#fff3e0,stroke:#f57c00,stroke-width:2px classDef deployment fill:#fce4ec,stroke:#c2185b,stroke-width:2px classDef building fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px classDef demo fill:#e1f5fe,stroke:#0277bd,stroke-width:2px classDef external fill:#f5f5f5,stroke:#424242,stroke-width:2px class TRAINED_MODEL,TRAINING_CONFIG,TRAINING_RESULTS,MODEL_METADATA input class PUSH_SCRIPT,REPO_CREATION,FILE_UPLOAD,METADATA_UPLOAD publishing class CARD_SCRIPT,TEMPLATE_LOAD,VARIABLE_REPLACEMENT,CONDITIONAL_PROCESSING generation class DEPLOY_SCRIPT,SPACE_CREATION,TEMPLATE_COPY,ENV_INJECTION,SECRET_SETUP deployment class BUILD_TRIGGER,DEPENDENCY_INSTALL,MODEL_DOWNLOAD,APP_INITIALIZATION building class GRADIO_INTERFACE,MODEL_INFERENCE,USER_INTERACTION demo class HF_HUB,HF_SPACES external ``` ## Why personalization improves accessibility - **Your model learns your patterns**: tempo, prosody, phoneme realizations, disfluencies - **Vocabulary and names**: teach domain terms and proper nouns you use often - **Bias correction**: reduce systematic errors common to off‑the‑shelf ASR for your voice - **Agency and privacy**: keep data local and only publish when you choose ## Practical tips - **Start with LoRA**: Parameter‑efficient fine‑tuning is faster and uses less memory - **Record diverse samples**: Different tempos, environments, and phrase lengths - **Short sessions**: Many shorter clips beat a few long ones for learning - **Check transcripts**: Clean, accurate transcripts improve outcomes ## Learn more - [Repository README](../README.md) - [Documentation Overview](README.md) - [Architecture Overview](architecture.md) - [Interface Workflow](interface-workflow.md) - [Training Pipeline](training-pipeline.md) - [Deployment Pipeline](deployment-pipeline.md) - [Data Flow](data-flow.md) - [Interactive Diagrams](diagrams.html) --- This project exists to make voice technology work better for everyone. If you build a model that helps you — or your community — consider sharing a demo so others can learn from it.