📋 Documentation Overview
High-level overview of the Voxtral ASR Fine-tuning application and its documentation structure.
graph TD
START(["Voxtral ASR Fine-tuning App"]) --> OVERVIEW{Choose Documentation}
OVERVIEW --> ARCH["Architecture Overview"]
OVERVIEW --> WORKFLOW["Interface Workflow"]
OVERVIEW --> TRAINING["Training Pipeline"]
OVERVIEW --> DEPLOYMENT["Deployment Pipeline"]
OVERVIEW --> DATAFLOW["Data Flow"]
ARCH --> ARCH_DIAG["High-level Architecture
System Components & Layers"]
WORKFLOW --> WORKFLOW_DIAG["User Journey
Recording → Training → Demo"]
TRAINING --> TRAINING_DIAG["Training Scripts
Data → Model → Results"]
DEPLOYMENT --> DEPLOYMENT_DIAG["Publishing & Demo
Model → Hub → Space"]
DATAFLOW --> DATAFLOW_DIAG["Complete Data Journey
Input → Processing → Output"]
subgraph "Core Components"
INTERFACE["interface.py
Gradio Web UI"]
TRAIN_SCRIPTS["scripts/train*.py
Training Scripts"]
DEPLOY_SCRIPT["scripts/deploy_demo_space.py
Demo Deployment"]
PUSH_SCRIPT["scripts/push_to_huggingface.py
Model Publishing"]
end
subgraph "Key Data Formats"
JSONL["JSONL Dataset
{'audio_path': '...', 'text': '...'}"]
HFDATA["HF Hub Models
username/model-name"]
SPACES["HF Spaces
Interactive Demos"]
end
INTERFACE --> WORKFLOW
TRAIN_SCRIPTS --> TRAINING
DEPLOY_SCRIPT --> DEPLOYMENT
PUSH_SCRIPT --> DEPLOYMENT
JSONL --> DATAFLOW
HFDATA --> DEPLOYMENT
SPACES --> DEPLOYMENT
classDef entry fill:#e3f2fd,stroke:#1976d2,stroke-width:3px
classDef category fill:#fff3e0,stroke:#f57c00,stroke-width:2px
classDef diagram fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
classDef component fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
classDef data fill:#e1f5fe,stroke:#0277bd,stroke-width:2px
class START entry
class OVERVIEW,ARCH,WORKFLOW,TRAINING,DEPLOYMENT,DATAFLOW category
class ARCH_DIAG,WORKFLOW_DIAG,TRAINING_DIAG,DEPLOYMENT_DIAG,DATAFLOW_DIAG diagram
class INTERFACE,TRAIN_SCRIPTS,DEPLOY_SCRIPT,PUSH_SCRIPT component
class JSONL,HFDATA,SPACES data
System Architecture
High-level architecture showing the main components and their relationships in the Voxtral ASR Fine-tuning application.
graph TB
subgraph "User Interface"
UI["Gradio Web Interface
interface.py"]
REC["Audio Recording
Microphone Input"]
UP["File Upload
WAV/FLAC files"]
end
subgraph "Data Processing"
DP["Data Processing
Audio resampling
JSONL creation"]
DS["Dataset Management
NVIDIA Granary
Local datasets"]
end
subgraph "Training Pipeline"
TF["Full Fine-tuning
scripts/train.py"]
TL["LoRA Fine-tuning
scripts/train_lora.py"]
TI["Trackio Integration
Experiment Tracking"]
end
subgraph "Model Management"
MM["Model Management
Hugging Face Hub
Local storage"]
MC["Model Card Generation
scripts/generate_model_card.py"]
end
subgraph "Deployment & Demo"
DEP["Demo Space Deployment
scripts/deploy_demo_space.py"]
HF["HF Spaces
Interactive Demo"]
end
subgraph "External Services"
HFH["Hugging Face Hub
Models & Datasets"]
GRAN["NVIDIA Granary
Multilingual ASR Dataset"]
TRACK["Trackio Spaces
Experiment Tracking"]
end
UI --> DP
REC --> DP
UP --> DP
DP --> DS
DS --> TF
DS --> TL
TF --> TI
TL --> TI
TF --> MM
TL --> MM
MM --> MC
MM --> DEP
DEP --> HF
DS -.-> HFH
MM -.-> HFH
TI -.-> TRACK
DS -.-> GRAN
classDef interface fill:#e1f5fe,stroke:#01579b,stroke-width:2px
classDef processing fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
classDef training fill:#e8f5e8,stroke:#1b5e20,stroke-width:2px
classDef management fill:#fff3e0,stroke:#e65100,stroke-width:2px
classDef deployment fill:#fce4ec,stroke:#880e4f,stroke-width:2px
classDef external fill:#f5f5f5,stroke:#424242,stroke-width:2px
class UI,REC,UP interface
class DP,DS processing
class TF,TL,TI training
class MM,MC management
class DEP,HF deployment
class HFH,GRAN,TRACK external
Interface Workflow
Complete user journey through the Voxtral ASR Fine-tuning interface, from language selection to demo deployment.
flowchart TD
START(["User Opens Interface"]) --> LANG["Language Selection
Choose from 25+ languages"]
LANG --> PHRASES["Load Phrases
From NVIDIA Granary"]
PHRASES --> RECORD["Recording Interface
Display phrases + audio recording"]
RECORD --> |User Records| PROCESS_REC["Process Recordings
Save WAV files + transcripts"]
RECORD --> |Upload Files| PROCESS_UPLOAD["Process Uploads
Handle existing files + transcripts"]
PROCESS_REC --> JSONL["Create JSONL Dataset
{'audio_path': '...', 'text': '...'}"]
PROCESS_UPLOAD --> JSONL
JSONL --> CONFIG["Training Configuration
Model, LoRA/full, hyperparameters"]
CONFIG --> TRAIN["Training Process
Execute train.py or train_lora.py"]
TRAIN --> PUSH["Push to Hub
Model + metadata to HF Hub"]
TRAIN --> CARD["Generate Model Card
Automated documentation"]
PUSH --> DEPLOY["Deploy Demo Space
Interactive demo on HF Spaces"]
DEPLOY --> END(["Demo Ready
Interactive ASR Demo"])
PUSH -.-> END
CARD -.-> END
classDef start fill:#e3f2fd,stroke:#1976d2,stroke-width:3px
classDef process fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
classDef decision fill:#fff3e0,stroke:#f57c00,stroke-width:2px
classDef terminal fill:#e8f5e8,stroke:#388e3c,stroke-width:3px
class START start
class END terminal
class LANG,PHRASES,RECORD,PROCESS_REC,PROCESS_UPLOAD,JSONL,CONFIG,TRAIN,PUSH,CARD,DEPLOY process
Training Pipeline
Detailed training pipeline showing how data flows through training scripts and supporting infrastructure.
graph TB
subgraph "Data Sources"
JSONL["JSONL Dataset
{'audio_path': '...', 'text': '...'}"]
GRANARY["NVIDIA Granary Dataset
Multilingual ASR Data"]
HFDATA["HF Hub Datasets
Community Datasets"]
end
subgraph "Data Processing"
LOADER["Dataset Loader
_load_jsonl_dataset()"]
CASTER["Audio Casting
16kHz resampling"]
COLLATOR["VoxtralDataCollator
Audio + Text Processing"]
end
subgraph "Training Scripts"
TRAIN_FULL["Full Fine-tuning
scripts/train.py"]
TRAIN_LORA["LoRA Fine-tuning
scripts/train_lora.py"]
subgraph "Training Components"
MODEL_INIT["Model Initialization
VoxtralForConditionalGeneration"]
LORA_CONFIG["LoRA Configuration
LoraConfig + get_peft_model"]
PROCESSOR_INIT["Processor Initialization
VoxtralProcessor"]
end
end
subgraph "Training Infrastructure"
TRACKIO_INIT["Trackio Integration
Experiment Tracking"]
HF_TRAINER["Hugging Face Trainer
TrainingArguments + Trainer"]
TORCH_DEVICE["Torch Device Setup
GPU/CPU Detection"]
end
subgraph "Training Process"
FORWARD_PASS["Forward Pass
Audio Processing + Generation"]
LOSS_CALC["Loss Calculation
Masked Language Modeling"]
BACKWARD_PASS["Backward Pass
Gradient Computation"]
OPTIMIZER_STEP["Optimizer Step
Parameter Updates"]
LOGGING["Metrics Logging
Loss, Perplexity, etc."]
end
subgraph "Model Management"
CHECKPOINT_SAVING["Checkpoint Saving
Model snapshots"]
MODEL_SAVING["Final Model Saving
Processor + Model"]
LOCAL_STORAGE["Local Storage
outputs/ directory"]
end
LOADER --> CASTER
CASTER --> COLLATOR
COLLATOR --> TRAIN_FULL
COLLATOR --> TRAIN_LORA
TRAIN_FULL --> MODEL_INIT
TRAIN_LORA --> MODEL_INIT
TRAIN_LORA --> LORA_CONFIG
MODEL_INIT --> PROCESSOR_INIT
LORA_CONFIG --> PROCESSOR_INIT
PROCESSOR_INIT --> TRACKIO_INIT
PROCESSOR_INIT --> HF_TRAINER
PROCESSOR_INIT --> TORCH_DEVICE
TRACKIO_INIT --> HF_TRAINER
TORCH_DEVICE --> HF_TRAINER
HF_TRAINER --> FORWARD_PASS
FORWARD_PASS --> LOSS_CALC
LOSS_CALC --> BACKWARD_PASS
BACKWARD_PASS --> OPTIMIZER_STEP
OPTIMIZER_STEP --> LOGGING
LOGGING --> CHECKPOINT_SAVING
LOGGING --> TRACKIO_INIT
HF_TRAINER --> MODEL_SAVING
MODEL_SAVING --> LOCAL_STORAGE
JSONL --> LOADER
GRANARY --> LOADER
HFDATA --> LOADER
classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
classDef processing fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
classDef training fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
classDef infrastructure fill:#fff3e0,stroke:#f57c00,stroke-width:2px
classDef execution fill:#fce4ec,stroke:#c2185b,stroke-width:2px
classDef output fill:#f5f5f5,stroke:#424242,stroke-width:2px
class JSONL,GRANARY,HFDATA input
class LOADER,CASTER,COLLATOR processing
class TRAIN_FULL,TRAIN_LORA,MODEL_INIT,LORA_CONFIG,PROCESSOR_INIT training
class TRACKIO_INIT,HF_TRAINER,TORCH_DEVICE infrastructure
class FORWARD_PASS,LOSS_CALC,BACKWARD_PASS,OPTIMIZER_STEP,LOGGING execution
class CHECKPOINT_SAVING,MODEL_SAVING,LOCAL_STORAGE output
Deployment Pipeline
Model publishing and demo deployment process from trained model to live interactive demo.
graph TB
subgraph "Inputs"
TRAINED_MODEL["Trained Model
Local directory"]
TRAINING_CONFIG["Training Config
JSON/YAML"]
TRAINING_RESULTS["Training Results
Metrics & logs"]
MODEL_METADATA["Model Metadata
Name, description, etc."]
end
subgraph "Model Publishing"
PUSH_SCRIPT["push_to_huggingface.py
Model Publisher"]
subgraph "Publishing Steps"
REPO_CREATION["Repository Creation
HF Hub API"]
FILE_UPLOAD["File Upload
Model files to HF"]
METADATA_UPLOAD["Metadata Upload
Config & results"]
end
end
subgraph "Model Card Generation"
CARD_SCRIPT["generate_model_card.py
Card Generator"]
subgraph "Card Components"
TEMPLATE_LOAD["Template Loading
model_card.md"]
VARIABLE_REPLACEMENT["Variable Replacement
Config injection"]
CONDITIONAL_PROCESSING["Conditional Sections
Quantized models, etc."]
end
end
subgraph "Demo Space Deployment"
DEPLOY_SCRIPT["deploy_demo_space.py
Space Deployer"]
subgraph "Space Setup"
SPACE_CREATION["Space Repository
Create HF Space"]
TEMPLATE_COPY["Template Copying
demo_voxtral/ files"]
ENV_INJECTION["Environment Setup
Model config injection"]
SECRET_SETUP["Secret Configuration
HF_TOKEN, model vars"]
end
end
subgraph "Space Building"
BUILD_TRIGGER[Build Trigger
Automatic build start]
DEPENDENCY_INSTALL[Dependency Installation
requirements.txt]
MODEL_DOWNLOAD[Model Download
From HF Hub]
APP_INITIALIZATION[App Initialization
Gradio app setup]
end
subgraph "Live Demo Space"
GRADIO_INTERFACE[Gradio Interface
Interactive demo]
MODEL_INFERENCE[Model Inference
Real-time ASR]
USER_INTERACTION[User Interaction
Audio upload/playback]
end
subgraph "External Services"
HF_HUB[Hugging Face Hub
Model & Space hosting]
HF_SPACES[HF Spaces Platform
Demo hosting]
end
TRAINED_MODEL --> PUSH_SCRIPT
TRAINING_CONFIG --> PUSH_SCRIPT
TRAINING_RESULTS --> PUSH_SCRIPT
MODEL_METADATA --> PUSH_SCRIPT
PUSH_SCRIPT --> REPO_CREATION
REPO_CREATION --> FILE_UPLOAD
FILE_UPLOAD --> METADATA_UPLOAD
METADATA_UPLOAD --> CARD_SCRIPT
TRAINING_CONFIG --> CARD_SCRIPT
TRAINING_RESULTS --> CARD_SCRIPT
CARD_SCRIPT --> TEMPLATE_LOAD
TEMPLATE_LOAD --> VARIABLE_REPLACEMENT
VARIABLE_REPLACEMENT --> CONDITIONAL_PROCESSING
CONDITIONAL_PROCESSING --> DEPLOY_SCRIPT
METADATA_UPLOAD --> DEPLOY_SCRIPT
DEPLOY_SCRIPT --> SPACE_CREATION
SPACE_CREATION --> TEMPLATE_COPY
TEMPLATE_COPY --> ENV_INJECTION
ENV_INJECTION --> SECRET_SETUP
SECRET_SETUP --> BUILD_TRIGGER
BUILD_TRIGGER --> DEPENDENCY_INSTALL
DEPENDENCY_INSTALL --> MODEL_DOWNLOAD
MODEL_DOWNLOAD --> APP_INITIALIZATION
APP_INITIALIZATION --> GRADIO_INTERFACE
GRADIO_INTERFACE --> MODEL_INFERENCE
MODEL_INFERENCE --> USER_INTERACTION
HF_HUB --> MODEL_DOWNLOAD
HF_SPACES --> GRADIO_INTERFACE
classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
classDef publishing fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
classDef generation fill:#fff3e0,stroke:#f57c00,stroke-width:2px
classDef deployment fill:#fce4ec,stroke:#c2185b,stroke-width:2px
classDef building fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
classDef demo fill:#e1f5fe,stroke:#0277bd,stroke-width:2px
classDef external fill:#f5f5f5,stroke:#424242,stroke-width:2px
class TRAINED_MODEL,TRAINING_CONFIG,TRAINING_RESULTS,MODEL_METADATA input
class PUSH_SCRIPT,REPO_CREATION,FILE_UPLOAD,METADATA_UPLOAD publishing
class CARD_SCRIPT,TEMPLATE_LOAD,VARIABLE_REPLACEMENT,CONDITIONAL_PROCESSING generation
class DEPLOY_SCRIPT,SPACE_CREATION,TEMPLATE_COPY,ENV_INJECTION,SECRET_SETUP deployment
class BUILD_TRIGGER,DEPENDENCY_INSTALL,MODEL_DOWNLOAD,APP_INITIALIZATION building
class GRADIO_INTERFACE,MODEL_INFERENCE,USER_INTERACTION demo
class HF_HUB,HF_SPACES external
Data Flow
Complete data journey through the Voxtral ASR Fine-tuning application from user input to deployed demo.
flowchart TD
subgraph "User Input"
MIC["Microphone Recording
Raw audio + timestamps"]
FILE["File Upload
WAV/FLAC files"]
TEXT["Manual Transcripts
Text input"]
LANG["Language Selection
25+ languages"]
end
subgraph "Data Processing"
AUDIO_PROC["Audio Processing
Resampling to 16kHz
Format conversion"]
TEXT_PROC["Text Processing
Transcript validation
Cleaning & formatting"]
JSONL_CONV["JSONL Conversion
{'audio_path': '...', 'text': '...'}"]
end
subgraph "Dataset Storage"
LOCAL_DS["Local Dataset
datasets/voxtral_user/
data.jsonl + wavs/"]
HF_DS["HF Hub Dataset
username/dataset-name
Public sharing"]
end
subgraph "Training Data Pipeline"
DS_LOADER["Dataset Loader
_load_jsonl_dataset()
or load_dataset()"]
AUDIO_CAST["Audio Casting
Audio(sampling_rate=16000)"]
TRAIN_SPLIT["Train Split
train_dataset"]
EVAL_SPLIT["Eval Split
eval_dataset"]
end
subgraph "Model Training"
COLLATOR["VoxtralDataCollator
Audio + Text batching
Prompt construction"]
FORWARD["Forward Pass
Audio → Features → Text"]
LOSS["Loss Calculation
Masked LM loss"]
BACKWARD["Backward Pass
Gradient computation"]
OPTIMIZE["Parameter Updates
LoRA or full fine-tuning"]
end
subgraph "Training Outputs"
MODEL_FILES["Model Files
model.safetensors
config.json
tokenizer.json"]
TRAINING_LOGS["Training Logs
train_results.json
training_config.json
loss curves"]
CHECKPOINTS["Checkpoints
Intermediate models
best model tracking"]
end
subgraph "Publishing Pipeline"
HF_REPO["HF Repository
username/model-name
Model hosting"]
MODEL_CARD["Model Card
README.md
Training details
Usage examples"]
METADATA["Training Metadata
Config + results
Performance metrics"]
end
subgraph "Demo Deployment"
SPACE_REPO["HF Space Repository
username/model-name-demo
Demo hosting"]
DEMO_APP["Demo Application
Gradio interface
Real-time inference"]
ENV_VARS["Environment Config
HF_MODEL_ID
MODEL_NAME
secrets"]
end
MIC --> AUDIO_PROC
FILE --> AUDIO_PROC
TEXT --> TEXT_PROC
LANG --> TEXT_PROC
AUDIO_PROC --> JSONL_CONV
TEXT_PROC --> JSONL_CONV
JSONL_CONV --> LOCAL_DS
LOCAL_DS --> HF_DS
LOCAL_DS --> DS_LOADER
HF_DS --> DS_LOADER
DS_LOADER --> AUDIO_CAST
AUDIO_CAST --> TRAIN_SPLIT
AUDIO_CAST --> EVAL_SPLIT
TRAIN_SPLIT --> COLLATOR
EVAL_SPLIT --> COLLATOR
COLLATOR --> FORWARD
FORWARD --> LOSS
LOSS --> BACKWARD
BACKWARD --> OPTIMIZE
OPTIMIZE --> MODEL_FILES
OPTIMIZE --> TRAINING_LOGS
OPTIMIZE --> CHECKPOINTS
MODEL_FILES --> HF_REPO
TRAINING_LOGS --> HF_REPO
CHECKPOINTS --> HF_REPO
HF_REPO --> MODEL_CARD
TRAINING_LOGS --> MODEL_CARD
MODEL_CARD --> SPACE_REPO
HF_REPO --> SPACE_REPO
ENV_VARS --> SPACE_REPO
SPACE_REPO --> DEMO_APP
classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
classDef processing fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
classDef storage fill:#fff3e0,stroke:#f57c00,stroke-width:2px
classDef training fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
classDef output fill:#fce4ec,stroke:#c2185b,stroke-width:2px
classDef publishing fill:#e1f5fe,stroke:#0277bd,stroke-width:2px
classDef deployment fill:#f5f5f5,stroke:#424242,stroke-width:2px
class MIC,FILE,TEXT,LANG input
class AUDIO_PROC,TEXT_PROC,JSONL_CONV processing
class LOCAL_DS,HF_DS storage
class DS_LOADER,AUDIO_CAST,TRAIN_SPLIT,EVAL_SPLIT,COLLATOR,FORWARD,LOSS,BACKWARD,OPTIMIZE training
class MODEL_FILES,TRAINING_LOGS,CHECKPOINTS output
class HF_REPO,MODEL_CARD,METADATA publishing
class SPACE_REPO,DEMO_APP,ENV_VARS deployment