🎯 Voxtral ASR Fine-tuning

Architecture & Workflow Diagrams

Interactive documentation with Mermaid diagrams

📋 Documentation Overview
High-level overview of the Voxtral ASR Fine-tuning application and its documentation structure.
graph TD START(["Voxtral ASR Fine-tuning App"]) --> OVERVIEW{Choose Documentation} OVERVIEW --> ARCH["Architecture Overview"] OVERVIEW --> WORKFLOW["Interface Workflow"] OVERVIEW --> TRAINING["Training Pipeline"] OVERVIEW --> DEPLOYMENT["Deployment Pipeline"] OVERVIEW --> DATAFLOW["Data Flow"] ARCH --> ARCH_DIAG["High-level Architecture
System Components & Layers"] WORKFLOW --> WORKFLOW_DIAG["User Journey
Recording → Training → Demo"] TRAINING --> TRAINING_DIAG["Training Scripts
Data → Model → Results"] DEPLOYMENT --> DEPLOYMENT_DIAG["Publishing & Demo
Model → Hub → Space"] DATAFLOW --> DATAFLOW_DIAG["Complete Data Journey
Input → Processing → Output"] subgraph "Core Components" INTERFACE["interface.py
Gradio Web UI"] TRAIN_SCRIPTS["scripts/train*.py
Training Scripts"] DEPLOY_SCRIPT["scripts/deploy_demo_space.py
Demo Deployment"] PUSH_SCRIPT["scripts/push_to_huggingface.py
Model Publishing"] end subgraph "Key Data Formats" JSONL["JSONL Dataset
{'audio_path': '...', 'text': '...'}"] HFDATA["HF Hub Models
username/model-name"] SPACES["HF Spaces
Interactive Demos"] end INTERFACE --> WORKFLOW TRAIN_SCRIPTS --> TRAINING DEPLOY_SCRIPT --> DEPLOYMENT PUSH_SCRIPT --> DEPLOYMENT JSONL --> DATAFLOW HFDATA --> DEPLOYMENT SPACES --> DEPLOYMENT classDef entry fill:#e3f2fd,stroke:#1976d2,stroke-width:3px classDef category fill:#fff3e0,stroke:#f57c00,stroke-width:2px classDef diagram fill:#e8f5e8,stroke:#388e3c,stroke-width:2px classDef component fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px classDef data fill:#e1f5fe,stroke:#0277bd,stroke-width:2px class START entry class OVERVIEW,ARCH,WORKFLOW,TRAINING,DEPLOYMENT,DATAFLOW category class ARCH_DIAG,WORKFLOW_DIAG,TRAINING_DIAG,DEPLOYMENT_DIAG,DATAFLOW_DIAG diagram class INTERFACE,TRAIN_SCRIPTS,DEPLOY_SCRIPT,PUSH_SCRIPT component class JSONL,HFDATA,SPACES data
System Architecture
High-level architecture showing the main components and their relationships in the Voxtral ASR Fine-tuning application.
graph TB subgraph "User Interface" UI["Gradio Web Interface
interface.py"] REC["Audio Recording
Microphone Input"] UP["File Upload
WAV/FLAC files"] end subgraph "Data Processing" DP["Data Processing
Audio resampling
JSONL creation"] DS["Dataset Management
NVIDIA Granary
Local datasets"] end subgraph "Training Pipeline" TF["Full Fine-tuning
scripts/train.py"] TL["LoRA Fine-tuning
scripts/train_lora.py"] TI["Trackio Integration
Experiment Tracking"] end subgraph "Model Management" MM["Model Management
Hugging Face Hub
Local storage"] MC["Model Card Generation
scripts/generate_model_card.py"] end subgraph "Deployment & Demo" DEP["Demo Space Deployment
scripts/deploy_demo_space.py"] HF["HF Spaces
Interactive Demo"] end subgraph "External Services" HFH["Hugging Face Hub
Models & Datasets"] GRAN["NVIDIA Granary
Multilingual ASR Dataset"] TRACK["Trackio Spaces
Experiment Tracking"] end UI --> DP REC --> DP UP --> DP DP --> DS DS --> TF DS --> TL TF --> TI TL --> TI TF --> MM TL --> MM MM --> MC MM --> DEP DEP --> HF DS -.-> HFH MM -.-> HFH TI -.-> TRACK DS -.-> GRAN classDef interface fill:#e1f5fe,stroke:#01579b,stroke-width:2px classDef processing fill:#f3e5f5,stroke:#4a148c,stroke-width:2px classDef training fill:#e8f5e8,stroke:#1b5e20,stroke-width:2px classDef management fill:#fff3e0,stroke:#e65100,stroke-width:2px classDef deployment fill:#fce4ec,stroke:#880e4f,stroke-width:2px classDef external fill:#f5f5f5,stroke:#424242,stroke-width:2px class UI,REC,UP interface class DP,DS processing class TF,TL,TI training class MM,MC management class DEP,HF deployment class HFH,GRAN,TRACK external
Interface Workflow
Complete user journey through the Voxtral ASR Fine-tuning interface, from language selection to demo deployment.
flowchart TD START(["User Opens Interface"]) --> LANG["Language Selection
Choose from 25+ languages"] LANG --> PHRASES["Load Phrases
From NVIDIA Granary"] PHRASES --> RECORD["Recording Interface
Display phrases + audio recording"] RECORD --> |User Records| PROCESS_REC["Process Recordings
Save WAV files + transcripts"] RECORD --> |Upload Files| PROCESS_UPLOAD["Process Uploads
Handle existing files + transcripts"] PROCESS_REC --> JSONL["Create JSONL Dataset
{'audio_path': '...', 'text': '...'}"] PROCESS_UPLOAD --> JSONL JSONL --> CONFIG["Training Configuration
Model, LoRA/full, hyperparameters"] CONFIG --> TRAIN["Training Process
Execute train.py or train_lora.py"] TRAIN --> PUSH["Push to Hub
Model + metadata to HF Hub"] TRAIN --> CARD["Generate Model Card
Automated documentation"] PUSH --> DEPLOY["Deploy Demo Space
Interactive demo on HF Spaces"] DEPLOY --> END(["Demo Ready
Interactive ASR Demo"]) PUSH -.-> END CARD -.-> END classDef start fill:#e3f2fd,stroke:#1976d2,stroke-width:3px classDef process fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px classDef decision fill:#fff3e0,stroke:#f57c00,stroke-width:2px classDef terminal fill:#e8f5e8,stroke:#388e3c,stroke-width:3px class START start class END terminal class LANG,PHRASES,RECORD,PROCESS_REC,PROCESS_UPLOAD,JSONL,CONFIG,TRAIN,PUSH,CARD,DEPLOY process
Training Pipeline
Detailed training pipeline showing how data flows through training scripts and supporting infrastructure.
graph TB subgraph "Data Sources" JSONL["JSONL Dataset
{'audio_path': '...', 'text': '...'}"] GRANARY["NVIDIA Granary Dataset
Multilingual ASR Data"] HFDATA["HF Hub Datasets
Community Datasets"] end subgraph "Data Processing" LOADER["Dataset Loader
_load_jsonl_dataset()"] CASTER["Audio Casting
16kHz resampling"] COLLATOR["VoxtralDataCollator
Audio + Text Processing"] end subgraph "Training Scripts" TRAIN_FULL["Full Fine-tuning
scripts/train.py"] TRAIN_LORA["LoRA Fine-tuning
scripts/train_lora.py"] subgraph "Training Components" MODEL_INIT["Model Initialization
VoxtralForConditionalGeneration"] LORA_CONFIG["LoRA Configuration
LoraConfig + get_peft_model"] PROCESSOR_INIT["Processor Initialization
VoxtralProcessor"] end end subgraph "Training Infrastructure" TRACKIO_INIT["Trackio Integration
Experiment Tracking"] HF_TRAINER["Hugging Face Trainer
TrainingArguments + Trainer"] TORCH_DEVICE["Torch Device Setup
GPU/CPU Detection"] end subgraph "Training Process" FORWARD_PASS["Forward Pass
Audio Processing + Generation"] LOSS_CALC["Loss Calculation
Masked Language Modeling"] BACKWARD_PASS["Backward Pass
Gradient Computation"] OPTIMIZER_STEP["Optimizer Step
Parameter Updates"] LOGGING["Metrics Logging
Loss, Perplexity, etc."] end subgraph "Model Management" CHECKPOINT_SAVING["Checkpoint Saving
Model snapshots"] MODEL_SAVING["Final Model Saving
Processor + Model"] LOCAL_STORAGE["Local Storage
outputs/ directory"] end LOADER --> CASTER CASTER --> COLLATOR COLLATOR --> TRAIN_FULL COLLATOR --> TRAIN_LORA TRAIN_FULL --> MODEL_INIT TRAIN_LORA --> MODEL_INIT TRAIN_LORA --> LORA_CONFIG MODEL_INIT --> PROCESSOR_INIT LORA_CONFIG --> PROCESSOR_INIT PROCESSOR_INIT --> TRACKIO_INIT PROCESSOR_INIT --> HF_TRAINER PROCESSOR_INIT --> TORCH_DEVICE TRACKIO_INIT --> HF_TRAINER TORCH_DEVICE --> HF_TRAINER HF_TRAINER --> FORWARD_PASS FORWARD_PASS --> LOSS_CALC LOSS_CALC --> BACKWARD_PASS BACKWARD_PASS --> OPTIMIZER_STEP OPTIMIZER_STEP --> LOGGING LOGGING --> CHECKPOINT_SAVING LOGGING --> TRACKIO_INIT HF_TRAINER --> MODEL_SAVING MODEL_SAVING --> LOCAL_STORAGE JSONL --> LOADER GRANARY --> LOADER HFDATA --> LOADER classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px classDef processing fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px classDef training fill:#e8f5e8,stroke:#388e3c,stroke-width:2px classDef infrastructure fill:#fff3e0,stroke:#f57c00,stroke-width:2px classDef execution fill:#fce4ec,stroke:#c2185b,stroke-width:2px classDef output fill:#f5f5f5,stroke:#424242,stroke-width:2px class JSONL,GRANARY,HFDATA input class LOADER,CASTER,COLLATOR processing class TRAIN_FULL,TRAIN_LORA,MODEL_INIT,LORA_CONFIG,PROCESSOR_INIT training class TRACKIO_INIT,HF_TRAINER,TORCH_DEVICE infrastructure class FORWARD_PASS,LOSS_CALC,BACKWARD_PASS,OPTIMIZER_STEP,LOGGING execution class CHECKPOINT_SAVING,MODEL_SAVING,LOCAL_STORAGE output
Deployment Pipeline
Model publishing and demo deployment process from trained model to live interactive demo.
graph TB subgraph "Inputs" TRAINED_MODEL["Trained Model
Local directory"] TRAINING_CONFIG["Training Config
JSON/YAML"] TRAINING_RESULTS["Training Results
Metrics & logs"] MODEL_METADATA["Model Metadata
Name, description, etc."] end subgraph "Model Publishing" PUSH_SCRIPT["push_to_huggingface.py
Model Publisher"] subgraph "Publishing Steps" REPO_CREATION["Repository Creation
HF Hub API"] FILE_UPLOAD["File Upload
Model files to HF"] METADATA_UPLOAD["Metadata Upload
Config & results"] end end subgraph "Model Card Generation" CARD_SCRIPT["generate_model_card.py
Card Generator"] subgraph "Card Components" TEMPLATE_LOAD["Template Loading
model_card.md"] VARIABLE_REPLACEMENT["Variable Replacement
Config injection"] CONDITIONAL_PROCESSING["Conditional Sections
Quantized models, etc."] end end subgraph "Demo Space Deployment" DEPLOY_SCRIPT["deploy_demo_space.py
Space Deployer"] subgraph "Space Setup" SPACE_CREATION["Space Repository
Create HF Space"] TEMPLATE_COPY["Template Copying
demo_voxtral/ files"] ENV_INJECTION["Environment Setup
Model config injection"] SECRET_SETUP["Secret Configuration
HF_TOKEN, model vars"] end end subgraph "Space Building" BUILD_TRIGGER[Build Trigger
Automatic build start] DEPENDENCY_INSTALL[Dependency Installation
requirements.txt] MODEL_DOWNLOAD[Model Download
From HF Hub] APP_INITIALIZATION[App Initialization
Gradio app setup] end subgraph "Live Demo Space" GRADIO_INTERFACE[Gradio Interface
Interactive demo] MODEL_INFERENCE[Model Inference
Real-time ASR] USER_INTERACTION[User Interaction
Audio upload/playback] end subgraph "External Services" HF_HUB[Hugging Face Hub
Model & Space hosting] HF_SPACES[HF Spaces Platform
Demo hosting] end TRAINED_MODEL --> PUSH_SCRIPT TRAINING_CONFIG --> PUSH_SCRIPT TRAINING_RESULTS --> PUSH_SCRIPT MODEL_METADATA --> PUSH_SCRIPT PUSH_SCRIPT --> REPO_CREATION REPO_CREATION --> FILE_UPLOAD FILE_UPLOAD --> METADATA_UPLOAD METADATA_UPLOAD --> CARD_SCRIPT TRAINING_CONFIG --> CARD_SCRIPT TRAINING_RESULTS --> CARD_SCRIPT CARD_SCRIPT --> TEMPLATE_LOAD TEMPLATE_LOAD --> VARIABLE_REPLACEMENT VARIABLE_REPLACEMENT --> CONDITIONAL_PROCESSING CONDITIONAL_PROCESSING --> DEPLOY_SCRIPT METADATA_UPLOAD --> DEPLOY_SCRIPT DEPLOY_SCRIPT --> SPACE_CREATION SPACE_CREATION --> TEMPLATE_COPY TEMPLATE_COPY --> ENV_INJECTION ENV_INJECTION --> SECRET_SETUP SECRET_SETUP --> BUILD_TRIGGER BUILD_TRIGGER --> DEPENDENCY_INSTALL DEPENDENCY_INSTALL --> MODEL_DOWNLOAD MODEL_DOWNLOAD --> APP_INITIALIZATION APP_INITIALIZATION --> GRADIO_INTERFACE GRADIO_INTERFACE --> MODEL_INFERENCE MODEL_INFERENCE --> USER_INTERACTION HF_HUB --> MODEL_DOWNLOAD HF_SPACES --> GRADIO_INTERFACE classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px classDef publishing fill:#e8f5e8,stroke:#388e3c,stroke-width:2px classDef generation fill:#fff3e0,stroke:#f57c00,stroke-width:2px classDef deployment fill:#fce4ec,stroke:#c2185b,stroke-width:2px classDef building fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px classDef demo fill:#e1f5fe,stroke:#0277bd,stroke-width:2px classDef external fill:#f5f5f5,stroke:#424242,stroke-width:2px class TRAINED_MODEL,TRAINING_CONFIG,TRAINING_RESULTS,MODEL_METADATA input class PUSH_SCRIPT,REPO_CREATION,FILE_UPLOAD,METADATA_UPLOAD publishing class CARD_SCRIPT,TEMPLATE_LOAD,VARIABLE_REPLACEMENT,CONDITIONAL_PROCESSING generation class DEPLOY_SCRIPT,SPACE_CREATION,TEMPLATE_COPY,ENV_INJECTION,SECRET_SETUP deployment class BUILD_TRIGGER,DEPENDENCY_INSTALL,MODEL_DOWNLOAD,APP_INITIALIZATION building class GRADIO_INTERFACE,MODEL_INFERENCE,USER_INTERACTION demo class HF_HUB,HF_SPACES external
Data Flow
Complete data journey through the Voxtral ASR Fine-tuning application from user input to deployed demo.
flowchart TD subgraph "User Input" MIC["Microphone Recording
Raw audio + timestamps"] FILE["File Upload
WAV/FLAC files"] TEXT["Manual Transcripts
Text input"] LANG["Language Selection
25+ languages"] end subgraph "Data Processing" AUDIO_PROC["Audio Processing
Resampling to 16kHz
Format conversion"] TEXT_PROC["Text Processing
Transcript validation
Cleaning & formatting"] JSONL_CONV["JSONL Conversion
{'audio_path': '...', 'text': '...'}"] end subgraph "Dataset Storage" LOCAL_DS["Local Dataset
datasets/voxtral_user/
data.jsonl + wavs/"] HF_DS["HF Hub Dataset
username/dataset-name
Public sharing"] end subgraph "Training Data Pipeline" DS_LOADER["Dataset Loader
_load_jsonl_dataset()
or load_dataset()"] AUDIO_CAST["Audio Casting
Audio(sampling_rate=16000)"] TRAIN_SPLIT["Train Split
train_dataset"] EVAL_SPLIT["Eval Split
eval_dataset"] end subgraph "Model Training" COLLATOR["VoxtralDataCollator
Audio + Text batching
Prompt construction"] FORWARD["Forward Pass
Audio → Features → Text"] LOSS["Loss Calculation
Masked LM loss"] BACKWARD["Backward Pass
Gradient computation"] OPTIMIZE["Parameter Updates
LoRA or full fine-tuning"] end subgraph "Training Outputs" MODEL_FILES["Model Files
model.safetensors
config.json
tokenizer.json"] TRAINING_LOGS["Training Logs
train_results.json
training_config.json
loss curves"] CHECKPOINTS["Checkpoints
Intermediate models
best model tracking"] end subgraph "Publishing Pipeline" HF_REPO["HF Repository
username/model-name
Model hosting"] MODEL_CARD["Model Card
README.md
Training details
Usage examples"] METADATA["Training Metadata
Config + results
Performance metrics"] end subgraph "Demo Deployment" SPACE_REPO["HF Space Repository
username/model-name-demo
Demo hosting"] DEMO_APP["Demo Application
Gradio interface
Real-time inference"] ENV_VARS["Environment Config
HF_MODEL_ID
MODEL_NAME
secrets"] end MIC --> AUDIO_PROC FILE --> AUDIO_PROC TEXT --> TEXT_PROC LANG --> TEXT_PROC AUDIO_PROC --> JSONL_CONV TEXT_PROC --> JSONL_CONV JSONL_CONV --> LOCAL_DS LOCAL_DS --> HF_DS LOCAL_DS --> DS_LOADER HF_DS --> DS_LOADER DS_LOADER --> AUDIO_CAST AUDIO_CAST --> TRAIN_SPLIT AUDIO_CAST --> EVAL_SPLIT TRAIN_SPLIT --> COLLATOR EVAL_SPLIT --> COLLATOR COLLATOR --> FORWARD FORWARD --> LOSS LOSS --> BACKWARD BACKWARD --> OPTIMIZE OPTIMIZE --> MODEL_FILES OPTIMIZE --> TRAINING_LOGS OPTIMIZE --> CHECKPOINTS MODEL_FILES --> HF_REPO TRAINING_LOGS --> HF_REPO CHECKPOINTS --> HF_REPO HF_REPO --> MODEL_CARD TRAINING_LOGS --> MODEL_CARD MODEL_CARD --> SPACE_REPO HF_REPO --> SPACE_REPO ENV_VARS --> SPACE_REPO SPACE_REPO --> DEMO_APP classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px classDef processing fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px classDef storage fill:#fff3e0,stroke:#f57c00,stroke-width:2px classDef training fill:#e8f5e8,stroke:#388e3c,stroke-width:2px classDef output fill:#fce4ec,stroke:#c2185b,stroke-width:2px classDef publishing fill:#e1f5fe,stroke:#0277bd,stroke-width:2px classDef deployment fill:#f5f5f5,stroke:#424242,stroke-width:2px class MIC,FILE,TEXT,LANG input class AUDIO_PROC,TEXT_PROC,JSONL_CONV processing class LOCAL_DS,HF_DS storage class DS_LOADER,AUDIO_CAST,TRAIN_SPLIT,EVAL_SPLIT,COLLATOR,FORWARD,LOSS,BACKWARD,OPTIMIZE training class MODEL_FILES,TRAINING_LOGS,CHECKPOINTS output class HF_REPO,MODEL_CARD,METADATA publishing class SPACE_REPO,DEMO_APP,ENV_VARS deployment