File size: 9,677 Bytes
a3a3978
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
# Voxtral ASR Fine-tuning Documentation

```mermaid
graph TD
    %% Main Entry Point
    START([🎯 Voxtral ASR Fine-tuning App]) --> OVERVIEW{Choose Documentation}

    %% Documentation Categories
    OVERVIEW --> ARCH[πŸ—οΈ Architecture Overview]
    OVERVIEW --> WORKFLOW[πŸ”„ Interface Workflow]
    OVERVIEW --> TRAINING[πŸš€ Training Pipeline]
    OVERVIEW --> DEPLOYMENT[🌐 Deployment Pipeline]
    OVERVIEW --> DATAFLOW[πŸ“Š Data Flow]

    %% Architecture Section
    ARCH --> ARCH_DIAG[High-level Architecture<br/>System Components & Layers]
    ARCH --> ARCH_LINK[πŸ“„ View Details β†’](architecture.md)

    %% Interface Section
    WORKFLOW --> WORKFLOW_DIAG[User Journey<br/>Recording β†’ Training β†’ Demo]
    WORKFLOW --> WORKFLOW_LINK[πŸ“„ View Details β†’](interface-workflow.md)

    %% Training Section
    TRAINING --> TRAINING_DIAG[Training Scripts<br/>Data β†’ Model β†’ Results]
    TRAINING --> TRAINING_LINK[πŸ“„ View Details β†’](training-pipeline.md)

    %% Deployment Section
    DEPLOYMENT --> DEPLOYMENT_DIAG[Publishing & Demo<br/>Model β†’ Hub β†’ Space]
    DEPLOYMENT --> DEPLOYMENT_LINK[πŸ“„ View Details β†’](deployment-pipeline.md)

    %% Data Flow Section
    DATAFLOW --> DATAFLOW_DIAG[Complete Data Journey<br/>Input β†’ Processing β†’ Output]
    DATAFLOW --> DATAFLOW_LINK[πŸ“„ View Details β†’](data-flow.md)

    %% Key Components Highlight
    subgraph "πŸŽ›οΈ Core Components"
        INTERFACE[interface.py<br/>Gradio Web UI]
        TRAIN_SCRIPTS[scripts/train*.py<br/>Training Scripts]
        DEPLOY_SCRIPT[scripts/deploy_demo_space.py<br/>Demo Deployment]
        PUSH_SCRIPT[scripts/push_to_huggingface.py<br/>Model Publishing]
    end

    %% Data Flow Highlight
    subgraph "πŸ“ Key Data Formats"
        JSONL[JSONL Dataset<br/>{"audio_path": "...", "text": "..."}]
        HFDATA[HF Hub Models<br/>username/model-name]
        SPACES[HF Spaces<br/>Interactive Demos]
    end

    %% Connect components to their respective docs
    INTERFACE --> WORKFLOW
    TRAIN_SCRIPTS --> TRAINING
    DEPLOY_SCRIPT --> DEPLOYMENT
    PUSH_SCRIPT --> DEPLOYMENT

    JSONL --> DATAFLOW
    HFDATA --> DEPLOYMENT
    SPACES --> DEPLOYMENT

    %% Styling
    classDef entry fill:#e3f2fd,stroke:#1976d2,stroke-width:3px
    classDef category fill:#fff3e0,stroke:#f57c00,stroke-width:2px
    classDef diagram fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
    classDef link fill:#fce4ec,stroke:#c2185b,stroke-width:2px
    classDef component fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
    classDef data fill:#e1f5fe,stroke:#0277bd,stroke-width:2px

    class START entry
    class OVERVIEW,ARCH,WORKFLOW,TRAINING,DEPLOYMENT,DATAFLOW category
    class ARCH_DIAG,WORKFLOW_DIAG,TRAINING_DIAG,DEPLOYMENT_DIAG,DATAFLOW_DIAG diagram
    class ARCH_LINK,WORKFLOW_LINK,TRAINING_LINK,DEPLOYMENT_LINK,DATAFLOW_LINK link
    class INTERFACE,TRAIN_SCRIPTS,DEPLOY_SCRIPT,PUSH_SCRIPT component
    class JSONL,HFDATA,SPACES data
```

## Voxtral ASR Fine-tuning Application

This documentation provides comprehensive diagrams and explanations of the Voxtral ASR Fine-tuning application architecture and workflows.

### 🎯 What is Voxtral ASR Fine-tuning?

Voxtral is a powerful Automatic Speech Recognition (ASR) model that can be fine-tuned for specific tasks and languages. This application provides:

- **πŸŽ™οΈ Easy Data Collection**: Record audio or upload files with transcripts
- **πŸš€ One-Click Training**: Fine-tune Voxtral with LoRA or full parameter updates
- **🌐 Instant Deployment**: Deploy interactive demos to Hugging Face Spaces
- **πŸ“Š Experiment Tracking**: Monitor training progress with Trackio integration

### πŸ“š Documentation Overview

#### πŸ—οΈ [Architecture Overview](architecture.md)
High-level view of system components and their relationships:
- **User Interface Layer**: Gradio web interface
- **Data Processing Layer**: Audio processing and dataset creation
- **Training Layer**: Full and LoRA fine-tuning scripts
- **Model Management Layer**: HF Hub integration and model cards
- **Deployment Layer**: Demo space deployment

#### πŸ”„ [Interface Workflow](interface-workflow.md)
Complete user journey through the application:
- **Language Selection**: Choose from 25+ languages via NVIDIA Granary
- **Data Collection**: Record audio or upload existing files
- **Dataset Creation**: Process audio + transcripts into JSONL format
- **Training Configuration**: Set hyperparameters and options
- **Live Training**: Real-time progress monitoring
- **Auto Deployment**: One-click model publishing and demo creation

#### πŸš€ [Training Pipeline](training-pipeline.md)
Detailed training process and script interactions:
- **Data Sources**: JSONL datasets, HF Hub datasets, NVIDIA Granary
- **Data Processing**: Audio resampling, text tokenization, data collation
- **Training Scripts**: `train.py` (full) vs `train_lora.py` (parameter-efficient)
- **Infrastructure**: Trackio logging, Hugging Face Trainer, device management
- **Model Outputs**: Trained models, training logs, checkpoints

#### 🌐 [Deployment Pipeline](deployment-pipeline.md)
Model publishing and demo deployment process:
- **Model Publishing**: Push to Hugging Face Hub with metadata
- **Model Card Generation**: Automated documentation creation
- **Demo Space Deployment**: Create interactive demos on HF Spaces
- **Configuration Management**: Environment variables and secrets
- **Live Demo Features**: Real-time ASR inference interface

#### πŸ“Š [Data Flow](data-flow.md)
Complete data journey through the system:
- **Input Sources**: Microphone recordings, file uploads, external datasets
- **Processing Pipeline**: Audio resampling, text cleaning, JSONL conversion
- **Training Flow**: Dataset loading, batching, model training
- **Output Pipeline**: Model files, logs, checkpoints, published assets
- **External Integration**: HF Hub, NVIDIA Granary, Trackio Spaces

### πŸ› οΈ Core Components

| Component | Purpose | Key Features |
|-----------|---------|--------------|
| `interface.py` | Main web application | Gradio UI, data collection, training orchestration |
| `scripts/train.py` | Full model fine-tuning | Complete parameter updates, maximum accuracy |
| `scripts/train_lora.py` | LoRA fine-tuning | Parameter-efficient, faster training, lower memory |
| `scripts/deploy_demo_space.py` | Demo deployment | Automated HF Spaces creation and configuration |
| `scripts/push_to_huggingface.py` | Model publishing | HF Hub integration, model card generation |
| `scripts/generate_model_card.py` | Documentation | Automated model card creation from templates |

### πŸ“ Key Data Formats

#### JSONL Dataset Format
```json
{"audio_path": "path/to/audio.wav", "text": "transcription text"}
```

#### Training Configuration
```json
{
  "model_checkpoint": "mistralai/Voxtral-Mini-3B-2507",
  "batch_size": 2,
  "learning_rate": 5e-5,
  "epochs": 3,
  "lora_r": 8,
  "lora_alpha": 32
}
```

#### Model Repository Structure
```
username/model-name/
β”œβ”€β”€ model.safetensors
β”œβ”€β”€ config.json
β”œβ”€β”€ tokenizer.json
β”œβ”€β”€ README.md (model card)
└── training_results/
```

### πŸš€ Quick Start

1. **Set Environment Variables**:
   ```bash
   export HF_TOKEN=your_huggingface_token
   export HF_USERNAME=your_username
   ```

2. **Launch Interface**:
   ```bash
   python interface.py
   ```

3. **Follow the Workflow**:
   - Select language β†’ Record/upload data β†’ Configure training β†’ Start training
   - Monitor progress β†’ View results β†’ Deploy demo

### πŸ“‹ Prerequisites

- **Hardware**: NVIDIA GPU recommended for training
- **Software**: Python 3.8+, CUDA-compatible GPU drivers
- **Tokens**: Hugging Face token for model access and publishing
- **Storage**: Sufficient disk space for models and datasets

### πŸ”§ Configuration Options

#### Training Modes
- **LoRA Fine-tuning**: Efficient, fast, lower memory usage
- **Full Fine-tuning**: Maximum accuracy, higher memory requirements

#### Data Sources
- **User Recordings**: Live microphone input
- **File Uploads**: Existing WAV/FLAC files
- **NVIDIA Granary**: High-quality multilingual datasets
- **HF Hub Datasets**: Community-contributed datasets

#### Deployment Options
- **HF Hub Publishing**: Share models publicly
- **Demo Spaces**: Interactive web demos
- **Model Cards**: Automated documentation

### πŸ“ˆ Performance & Metrics

#### Training Metrics
- **Loss Curves**: Training and validation loss
- **Perplexity**: Model confidence measure
- **Word Error Rate**: ASR accuracy (if available)
- **Training Time**: Time to convergence

#### Resource Usage
- **GPU Memory**: Peak memory usage during training
- **Training Time**: Hours/days depending on dataset size
- **Model Size**: Disk space requirements

### 🀝 Contributing

The documentation is organized as interlinked Markdown files with Mermaid diagrams. Each diagram focuses on a specific aspect:

- **architecture.md**: System overview and component relationships
- **interface-workflow.md**: User experience and interaction flow
- **training-pipeline.md**: Technical training process details
- **deployment-pipeline.md**: Publishing and deployment mechanics
- **data-flow.md**: Data movement and transformation

### πŸ“„ Additional Resources

- **Hugging Face Spaces**: [Live Demo](https://huggingface.co/spaces)
- **Voxtral Models**: [Model Hub](https://huggingface.co/mistralai)
- **NVIDIA Granary**: [Dataset Documentation](https://huggingface.co/nvidia/Granary)
- **Trackio**: [Experiment Tracking](https://trackio.space)

---

*This documentation was automatically generated to explain the Voxtral ASR Fine-tuning application architecture and workflows.*