# Voxtral ASR Fine-tuning Documentation
```mermaid
graph TD
%% Main Entry Point
START([🎯 Voxtral ASR Fine-tuning App]) --> OVERVIEW{Choose Documentation}
%% Documentation Categories
OVERVIEW --> ARCH[🏗️ Architecture Overview]
OVERVIEW --> WORKFLOW[🔄 Interface Workflow]
OVERVIEW --> TRAINING[🚀 Training Pipeline]
OVERVIEW --> DEPLOYMENT[🌐 Deployment Pipeline]
OVERVIEW --> DATAFLOW[📊 Data Flow]
%% Architecture Section
ARCH --> ARCH_DIAG[High-level Architecture
System Components & Layers]
ARCH --> ARCH_LINK[📄 View Details →](architecture.md)
%% Interface Section
WORKFLOW --> WORKFLOW_DIAG[User Journey
Recording → Training → Demo]
WORKFLOW --> WORKFLOW_LINK[📄 View Details →](interface-workflow.md)
%% Training Section
TRAINING --> TRAINING_DIAG[Training Scripts
Data → Model → Results]
TRAINING --> TRAINING_LINK[📄 View Details →](training-pipeline.md)
%% Deployment Section
DEPLOYMENT --> DEPLOYMENT_DIAG[Publishing & Demo
Model → Hub → Space]
DEPLOYMENT --> DEPLOYMENT_LINK[📄 View Details →](deployment-pipeline.md)
%% Data Flow Section
DATAFLOW --> DATAFLOW_DIAG[Complete Data Journey
Input → Processing → Output]
DATAFLOW --> DATAFLOW_LINK[📄 View Details →](data-flow.md)
%% Key Components Highlight
subgraph "🎛️ Core Components"
INTERFACE[interface.py
Gradio Web UI]
TRAIN_SCRIPTS[scripts/train*.py
Training Scripts]
DEPLOY_SCRIPT[scripts/deploy_demo_space.py
Demo Deployment]
PUSH_SCRIPT[scripts/push_to_huggingface.py
Model Publishing]
end
%% Data Flow Highlight
subgraph "📁 Key Data Formats"
JSONL[JSONL Dataset
{"audio_path": "...", "text": "..."}]
HFDATA[HF Hub Models
username/model-name]
SPACES[HF Spaces
Interactive Demos]
end
%% Connect components to their respective docs
INTERFACE --> WORKFLOW
TRAIN_SCRIPTS --> TRAINING
DEPLOY_SCRIPT --> DEPLOYMENT
PUSH_SCRIPT --> DEPLOYMENT
JSONL --> DATAFLOW
HFDATA --> DEPLOYMENT
SPACES --> DEPLOYMENT
%% Styling
classDef entry fill:#e3f2fd,stroke:#1976d2,stroke-width:3px
classDef category fill:#fff3e0,stroke:#f57c00,stroke-width:2px
classDef diagram fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
classDef link fill:#fce4ec,stroke:#c2185b,stroke-width:2px
classDef component fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
classDef data fill:#e1f5fe,stroke:#0277bd,stroke-width:2px
class START entry
class OVERVIEW,ARCH,WORKFLOW,TRAINING,DEPLOYMENT,DATAFLOW category
class ARCH_DIAG,WORKFLOW_DIAG,TRAINING_DIAG,DEPLOYMENT_DIAG,DATAFLOW_DIAG diagram
class ARCH_LINK,WORKFLOW_LINK,TRAINING_LINK,DEPLOYMENT_LINK,DATAFLOW_LINK link
class INTERFACE,TRAIN_SCRIPTS,DEPLOY_SCRIPT,PUSH_SCRIPT component
class JSONL,HFDATA,SPACES data
```
## Voxtral ASR Fine-tuning Application
This documentation provides comprehensive diagrams and explanations of the Voxtral ASR Fine-tuning application architecture and workflows.
### 🎯 What is Voxtral ASR Fine-tuning?
Voxtral is a powerful Automatic Speech Recognition (ASR) model that can be fine-tuned for specific tasks and languages. This application provides:
- **🎙️ Easy Data Collection**: Record audio or upload files with transcripts
- **🚀 One-Click Training**: Fine-tune Voxtral with LoRA or full parameter updates
- **🌐 Instant Deployment**: Deploy interactive demos to Hugging Face Spaces
- **📊 Experiment Tracking**: Monitor training progress with Trackio integration
### 📚 Documentation Overview
#### 🏗️ [Architecture Overview](architecture.md)
High-level view of system components and their relationships:
- **User Interface Layer**: Gradio web interface
- **Data Processing Layer**: Audio processing and dataset creation
- **Training Layer**: Full and LoRA fine-tuning scripts
- **Model Management Layer**: HF Hub integration and model cards
- **Deployment Layer**: Demo space deployment
#### 🔄 [Interface Workflow](interface-workflow.md)
Complete user journey through the application:
- **Language Selection**: Choose from 25+ languages via NVIDIA Granary
- **Data Collection**: Record audio or upload existing files
- **Dataset Creation**: Process audio + transcripts into JSONL format
- **Training Configuration**: Set hyperparameters and options
- **Live Training**: Real-time progress monitoring
- **Auto Deployment**: One-click model publishing and demo creation
#### 🚀 [Training Pipeline](training-pipeline.md)
Detailed training process and script interactions:
- **Data Sources**: JSONL datasets, HF Hub datasets, NVIDIA Granary
- **Data Processing**: Audio resampling, text tokenization, data collation
- **Training Scripts**: `train.py` (full) vs `train_lora.py` (parameter-efficient)
- **Infrastructure**: Trackio logging, Hugging Face Trainer, device management
- **Model Outputs**: Trained models, training logs, checkpoints
#### 🌐 [Deployment Pipeline](deployment-pipeline.md)
Model publishing and demo deployment process:
- **Model Publishing**: Push to Hugging Face Hub with metadata
- **Model Card Generation**: Automated documentation creation
- **Demo Space Deployment**: Create interactive demos on HF Spaces
- **Configuration Management**: Environment variables and secrets
- **Live Demo Features**: Real-time ASR inference interface
#### 📊 [Data Flow](data-flow.md)
Complete data journey through the system:
- **Input Sources**: Microphone recordings, file uploads, external datasets
- **Processing Pipeline**: Audio resampling, text cleaning, JSONL conversion
- **Training Flow**: Dataset loading, batching, model training
- **Output Pipeline**: Model files, logs, checkpoints, published assets
- **External Integration**: HF Hub, NVIDIA Granary, Trackio Spaces
### 🛠️ Core Components
| Component | Purpose | Key Features |
|-----------|---------|--------------|
| `interface.py` | Main web application | Gradio UI, data collection, training orchestration |
| `scripts/train.py` | Full model fine-tuning | Complete parameter updates, maximum accuracy |
| `scripts/train_lora.py` | LoRA fine-tuning | Parameter-efficient, faster training, lower memory |
| `scripts/deploy_demo_space.py` | Demo deployment | Automated HF Spaces creation and configuration |
| `scripts/push_to_huggingface.py` | Model publishing | HF Hub integration, model card generation |
| `scripts/generate_model_card.py` | Documentation | Automated model card creation from templates |
### 📁 Key Data Formats
#### JSONL Dataset Format
```json
{"audio_path": "path/to/audio.wav", "text": "transcription text"}
```
#### Training Configuration
```json
{
"model_checkpoint": "mistralai/Voxtral-Mini-3B-2507",
"batch_size": 2,
"learning_rate": 5e-5,
"epochs": 3,
"lora_r": 8,
"lora_alpha": 32
}
```
#### Model Repository Structure
```
username/model-name/
├── model.safetensors
├── config.json
├── tokenizer.json
├── README.md (model card)
└── training_results/
```
### 🚀 Quick Start
1. **Set Environment Variables**:
```bash
export HF_TOKEN=your_huggingface_token
export HF_USERNAME=your_username
```
2. **Launch Interface**:
```bash
python interface.py
```
3. **Follow the Workflow**:
- Select language → Record/upload data → Configure training → Start training
- Monitor progress → View results → Deploy demo
### 📋 Prerequisites
- **Hardware**: NVIDIA GPU recommended for training
- **Software**: Python 3.8+, CUDA-compatible GPU drivers
- **Tokens**: Hugging Face token for model access and publishing
- **Storage**: Sufficient disk space for models and datasets
### 🔧 Configuration Options
#### Training Modes
- **LoRA Fine-tuning**: Efficient, fast, lower memory usage
- **Full Fine-tuning**: Maximum accuracy, higher memory requirements
#### Data Sources
- **User Recordings**: Live microphone input
- **File Uploads**: Existing WAV/FLAC files
- **NVIDIA Granary**: High-quality multilingual datasets
- **HF Hub Datasets**: Community-contributed datasets
#### Deployment Options
- **HF Hub Publishing**: Share models publicly
- **Demo Spaces**: Interactive web demos
- **Model Cards**: Automated documentation
### 📈 Performance & Metrics
#### Training Metrics
- **Loss Curves**: Training and validation loss
- **Perplexity**: Model confidence measure
- **Word Error Rate**: ASR accuracy (if available)
- **Training Time**: Time to convergence
#### Resource Usage
- **GPU Memory**: Peak memory usage during training
- **Training Time**: Hours/days depending on dataset size
- **Model Size**: Disk space requirements
### 🤝 Contributing
The documentation is organized as interlinked Markdown files with Mermaid diagrams. Each diagram focuses on a specific aspect:
- **architecture.md**: System overview and component relationships
- **interface-workflow.md**: User experience and interaction flow
- **training-pipeline.md**: Technical training process details
- **deployment-pipeline.md**: Publishing and deployment mechanics
- **data-flow.md**: Data movement and transformation
### 📄 Additional Resources
- **Hugging Face Spaces**: [Live Demo](https://huggingface.co/spaces)
- **Voxtral Models**: [Model Hub](https://huggingface.co/mistralai)
- **NVIDIA Granary**: [Dataset Documentation](https://huggingface.co/nvidia/Granary)
- **Trackio**: [Experiment Tracking](https://trackio.space)
---
*This documentation was automatically generated to explain the Voxtral ASR Fine-tuning application architecture and workflows.*