Spaces:
Running
Running
File size: 9,677 Bytes
a3a3978 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 |
# Voxtral ASR Fine-tuning Documentation
```mermaid
graph TD
%% Main Entry Point
START([π― Voxtral ASR Fine-tuning App]) --> OVERVIEW{Choose Documentation}
%% Documentation Categories
OVERVIEW --> ARCH[ποΈ Architecture Overview]
OVERVIEW --> WORKFLOW[π Interface Workflow]
OVERVIEW --> TRAINING[π Training Pipeline]
OVERVIEW --> DEPLOYMENT[π Deployment Pipeline]
OVERVIEW --> DATAFLOW[π Data Flow]
%% Architecture Section
ARCH --> ARCH_DIAG[High-level Architecture<br/>System Components & Layers]
ARCH --> ARCH_LINK[π View Details β](architecture.md)
%% Interface Section
WORKFLOW --> WORKFLOW_DIAG[User Journey<br/>Recording β Training β Demo]
WORKFLOW --> WORKFLOW_LINK[π View Details β](interface-workflow.md)
%% Training Section
TRAINING --> TRAINING_DIAG[Training Scripts<br/>Data β Model β Results]
TRAINING --> TRAINING_LINK[π View Details β](training-pipeline.md)
%% Deployment Section
DEPLOYMENT --> DEPLOYMENT_DIAG[Publishing & Demo<br/>Model β Hub β Space]
DEPLOYMENT --> DEPLOYMENT_LINK[π View Details β](deployment-pipeline.md)
%% Data Flow Section
DATAFLOW --> DATAFLOW_DIAG[Complete Data Journey<br/>Input β Processing β Output]
DATAFLOW --> DATAFLOW_LINK[π View Details β](data-flow.md)
%% Key Components Highlight
subgraph "ποΈ Core Components"
INTERFACE[interface.py<br/>Gradio Web UI]
TRAIN_SCRIPTS[scripts/train*.py<br/>Training Scripts]
DEPLOY_SCRIPT[scripts/deploy_demo_space.py<br/>Demo Deployment]
PUSH_SCRIPT[scripts/push_to_huggingface.py<br/>Model Publishing]
end
%% Data Flow Highlight
subgraph "π Key Data Formats"
JSONL[JSONL Dataset<br/>{"audio_path": "...", "text": "..."}]
HFDATA[HF Hub Models<br/>username/model-name]
SPACES[HF Spaces<br/>Interactive Demos]
end
%% Connect components to their respective docs
INTERFACE --> WORKFLOW
TRAIN_SCRIPTS --> TRAINING
DEPLOY_SCRIPT --> DEPLOYMENT
PUSH_SCRIPT --> DEPLOYMENT
JSONL --> DATAFLOW
HFDATA --> DEPLOYMENT
SPACES --> DEPLOYMENT
%% Styling
classDef entry fill:#e3f2fd,stroke:#1976d2,stroke-width:3px
classDef category fill:#fff3e0,stroke:#f57c00,stroke-width:2px
classDef diagram fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
classDef link fill:#fce4ec,stroke:#c2185b,stroke-width:2px
classDef component fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
classDef data fill:#e1f5fe,stroke:#0277bd,stroke-width:2px
class START entry
class OVERVIEW,ARCH,WORKFLOW,TRAINING,DEPLOYMENT,DATAFLOW category
class ARCH_DIAG,WORKFLOW_DIAG,TRAINING_DIAG,DEPLOYMENT_DIAG,DATAFLOW_DIAG diagram
class ARCH_LINK,WORKFLOW_LINK,TRAINING_LINK,DEPLOYMENT_LINK,DATAFLOW_LINK link
class INTERFACE,TRAIN_SCRIPTS,DEPLOY_SCRIPT,PUSH_SCRIPT component
class JSONL,HFDATA,SPACES data
```
## Voxtral ASR Fine-tuning Application
This documentation provides comprehensive diagrams and explanations of the Voxtral ASR Fine-tuning application architecture and workflows.
### π― What is Voxtral ASR Fine-tuning?
Voxtral is a powerful Automatic Speech Recognition (ASR) model that can be fine-tuned for specific tasks and languages. This application provides:
- **ποΈ Easy Data Collection**: Record audio or upload files with transcripts
- **π One-Click Training**: Fine-tune Voxtral with LoRA or full parameter updates
- **π Instant Deployment**: Deploy interactive demos to Hugging Face Spaces
- **π Experiment Tracking**: Monitor training progress with Trackio integration
### π Documentation Overview
#### ποΈ [Architecture Overview](architecture.md)
High-level view of system components and their relationships:
- **User Interface Layer**: Gradio web interface
- **Data Processing Layer**: Audio processing and dataset creation
- **Training Layer**: Full and LoRA fine-tuning scripts
- **Model Management Layer**: HF Hub integration and model cards
- **Deployment Layer**: Demo space deployment
#### π [Interface Workflow](interface-workflow.md)
Complete user journey through the application:
- **Language Selection**: Choose from 25+ languages via NVIDIA Granary
- **Data Collection**: Record audio or upload existing files
- **Dataset Creation**: Process audio + transcripts into JSONL format
- **Training Configuration**: Set hyperparameters and options
- **Live Training**: Real-time progress monitoring
- **Auto Deployment**: One-click model publishing and demo creation
#### π [Training Pipeline](training-pipeline.md)
Detailed training process and script interactions:
- **Data Sources**: JSONL datasets, HF Hub datasets, NVIDIA Granary
- **Data Processing**: Audio resampling, text tokenization, data collation
- **Training Scripts**: `train.py` (full) vs `train_lora.py` (parameter-efficient)
- **Infrastructure**: Trackio logging, Hugging Face Trainer, device management
- **Model Outputs**: Trained models, training logs, checkpoints
#### π [Deployment Pipeline](deployment-pipeline.md)
Model publishing and demo deployment process:
- **Model Publishing**: Push to Hugging Face Hub with metadata
- **Model Card Generation**: Automated documentation creation
- **Demo Space Deployment**: Create interactive demos on HF Spaces
- **Configuration Management**: Environment variables and secrets
- **Live Demo Features**: Real-time ASR inference interface
#### π [Data Flow](data-flow.md)
Complete data journey through the system:
- **Input Sources**: Microphone recordings, file uploads, external datasets
- **Processing Pipeline**: Audio resampling, text cleaning, JSONL conversion
- **Training Flow**: Dataset loading, batching, model training
- **Output Pipeline**: Model files, logs, checkpoints, published assets
- **External Integration**: HF Hub, NVIDIA Granary, Trackio Spaces
### π οΈ Core Components
| Component | Purpose | Key Features |
|-----------|---------|--------------|
| `interface.py` | Main web application | Gradio UI, data collection, training orchestration |
| `scripts/train.py` | Full model fine-tuning | Complete parameter updates, maximum accuracy |
| `scripts/train_lora.py` | LoRA fine-tuning | Parameter-efficient, faster training, lower memory |
| `scripts/deploy_demo_space.py` | Demo deployment | Automated HF Spaces creation and configuration |
| `scripts/push_to_huggingface.py` | Model publishing | HF Hub integration, model card generation |
| `scripts/generate_model_card.py` | Documentation | Automated model card creation from templates |
### π Key Data Formats
#### JSONL Dataset Format
```json
{"audio_path": "path/to/audio.wav", "text": "transcription text"}
```
#### Training Configuration
```json
{
"model_checkpoint": "mistralai/Voxtral-Mini-3B-2507",
"batch_size": 2,
"learning_rate": 5e-5,
"epochs": 3,
"lora_r": 8,
"lora_alpha": 32
}
```
#### Model Repository Structure
```
username/model-name/
βββ model.safetensors
βββ config.json
βββ tokenizer.json
βββ README.md (model card)
βββ training_results/
```
### π Quick Start
1. **Set Environment Variables**:
```bash
export HF_TOKEN=your_huggingface_token
export HF_USERNAME=your_username
```
2. **Launch Interface**:
```bash
python interface.py
```
3. **Follow the Workflow**:
- Select language β Record/upload data β Configure training β Start training
- Monitor progress β View results β Deploy demo
### π Prerequisites
- **Hardware**: NVIDIA GPU recommended for training
- **Software**: Python 3.8+, CUDA-compatible GPU drivers
- **Tokens**: Hugging Face token for model access and publishing
- **Storage**: Sufficient disk space for models and datasets
### π§ Configuration Options
#### Training Modes
- **LoRA Fine-tuning**: Efficient, fast, lower memory usage
- **Full Fine-tuning**: Maximum accuracy, higher memory requirements
#### Data Sources
- **User Recordings**: Live microphone input
- **File Uploads**: Existing WAV/FLAC files
- **NVIDIA Granary**: High-quality multilingual datasets
- **HF Hub Datasets**: Community-contributed datasets
#### Deployment Options
- **HF Hub Publishing**: Share models publicly
- **Demo Spaces**: Interactive web demos
- **Model Cards**: Automated documentation
### π Performance & Metrics
#### Training Metrics
- **Loss Curves**: Training and validation loss
- **Perplexity**: Model confidence measure
- **Word Error Rate**: ASR accuracy (if available)
- **Training Time**: Time to convergence
#### Resource Usage
- **GPU Memory**: Peak memory usage during training
- **Training Time**: Hours/days depending on dataset size
- **Model Size**: Disk space requirements
### π€ Contributing
The documentation is organized as interlinked Markdown files with Mermaid diagrams. Each diagram focuses on a specific aspect:
- **architecture.md**: System overview and component relationships
- **interface-workflow.md**: User experience and interaction flow
- **training-pipeline.md**: Technical training process details
- **deployment-pipeline.md**: Publishing and deployment mechanics
- **data-flow.md**: Data movement and transformation
### π Additional Resources
- **Hugging Face Spaces**: [Live Demo](https://huggingface.co/spaces)
- **Voxtral Models**: [Model Hub](https://huggingface.co/mistralai)
- **NVIDIA Granary**: [Dataset Documentation](https://huggingface.co/nvidia/Granary)
- **Trackio**: [Experiment Tracking](https://trackio.space)
---
*This documentation was automatically generated to explain the Voxtral ASR Fine-tuning application architecture and workflows.*
|