File size: 12,003 Bytes
a3a3978
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
# Data Flow

```mermaid
flowchart TD
    %% User Input Sources
    subgraph "User Input"
        MIC[🎀 Microphone Recording<br/>Raw audio + timestamps]
        FILE[πŸ“ File Upload<br/>WAV/FLAC files]
        TEXT[πŸ“ Manual Transcripts<br/>Text input]
        LANG[🌍 Language Selection<br/>25+ languages]
    end

    %% Data Processing Pipeline
    subgraph "Data Processing"
        AUDIO_PROC[Audio Processing<br/>Resampling to 16kHz<br/>Format conversion]
        TEXT_PROC[Text Processing<br/>Transcript validation<br/>Cleaning & formatting]
        JSONL_CONV[JSONL Conversion<br/>{"audio_path": "...", "text": "..."}]
    end

    %% Dataset Storage
    subgraph "Dataset Storage"
        LOCAL_DS[Local Dataset<br/>datasets/voxtral_user/<br/>data.jsonl + wavs/]
        HF_DS[HF Hub Dataset<br/>username/dataset-name<br/>Public sharing]
    end

    %% Training Data Flow
    subgraph "Training Data Pipeline"
        DS_LOADER[Dataset Loader<br/>_load_jsonl_dataset()<br/>or load_dataset()]
        AUDIO_CAST[Audio Casting<br/>Audio(sampling_rate=16000)]
        TRAIN_SPLIT[Train Split<br/>train_dataset]
        EVAL_SPLIT[Eval Split<br/>eval_dataset]
    end

    %% Model Training
    subgraph "Model Training"
        COLLATOR[VoxtralDataCollator<br/>Audio + Text batching<br/>Prompt construction]
        FORWARD[Forward Pass<br/>Audio β†’ Features β†’ Text]
        LOSS[Loss Calculation<br/>Masked LM loss]
        BACKWARD[Backward Pass<br/>Gradient computation]
        OPTIMIZE[Parameter Updates<br/>LoRA or full fine-tuning]
    end

    %% Training Outputs
    subgraph "Training Outputs"
        MODEL_FILES[Model Files<br/>model.safetensors<br/>config.json<br/>tokenizer.json]
        TRAINING_LOGS[Training Logs<br/>train_results.json<br/>training_config.json<br/>loss curves]
        CHECKPOINTS[Checkpoints<br/>Intermediate models<br/>best model tracking]
    end

    %% Publishing Pipeline
    subgraph "Publishing Pipeline"
        HF_REPO[HF Repository<br/>username/model-name<br/>Model hosting]
        MODEL_CARD[Model Card<br/>README.md<br/>Training details<br/>Usage examples]
        METADATA[Training Metadata<br/>Config + results<br/>Performance metrics]
    end

    %% Demo Deployment
    subgraph "Demo Deployment"
        SPACE_REPO[HF Space Repository<br/>username/model-name-demo<br/>Demo hosting]
        DEMO_APP[Demo Application<br/>Gradio interface<br/>Real-time inference]
        ENV_VARS[Environment Config<br/>HF_MODEL_ID<br/>MODEL_NAME<br/>secrets]
    end

    %% External Data Sources
    subgraph "External Data Sources"
        GRANARY[NVIDIA Granary<br/>Multilingual ASR data<br/>25+ languages]
        HF_COMM[HF Community Datasets<br/>Public ASR datasets<br/>Standard formats]
    end

    %% Data Flow Connections
    MIC --> AUDIO_PROC
    FILE --> AUDIO_PROC
    TEXT --> TEXT_PROC
    LANG --> TEXT_PROC

    AUDIO_PROC --> JSONL_CONV
    TEXT_PROC --> JSONL_CONV

    JSONL_CONV --> LOCAL_DS
    LOCAL_DS --> HF_DS

    LOCAL_DS --> DS_LOADER
    HF_DS --> DS_LOADER
    GRANARY --> DS_LOADER
    HF_COMM --> DS_LOADER

    DS_LOADER --> AUDIO_CAST
    AUDIO_CAST --> TRAIN_SPLIT
    AUDIO_CAST --> EVAL_SPLIT

    TRAIN_SPLIT --> COLLATOR
    EVAL_SPLIT --> COLLATOR

    COLLATOR --> FORWARD
    FORWARD --> LOSS
    LOSS --> BACKWARD
    BACKWARD --> OPTIMIZE

    OPTIMIZE --> MODEL_FILES
    OPTIMIZE --> TRAINING_LOGS
    OPTIMIZE --> CHECKPOINTS

    MODEL_FILES --> HF_REPO
    TRAINING_LOGS --> HF_REPO
    CHECKPOINTS --> HF_REPO

    HF_REPO --> MODEL_CARD
    TRAINING_LOGS --> MODEL_CARD

    MODEL_CARD --> SPACE_REPO
    HF_REPO --> SPACE_REPO
    ENV_VARS --> SPACE_REPO

    SPACE_REPO --> DEMO_APP

    %% Styling
    classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    classDef processing fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
    classDef storage fill:#fff3e0,stroke:#f57c00,stroke-width:2px
    classDef training fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
    classDef output fill:#fce4ec,stroke:#c2185b,stroke-width:2px
    classDef publishing fill:#e1f5fe,stroke:#0277bd,stroke-width:2px
    classDef deployment fill:#f5f5f5,stroke:#424242,stroke-width:2px
    classDef external fill:#efebe9,stroke:#5d4037,stroke-width:2px

    class MIC,FILE,TEXT,LANG input
    class AUDIO_PROC,TEXT_PROC,JSONL_CONV processing
    class LOCAL_DS,HF_DS storage
    class DS_LOADER,AUDIO_CAST,TRAIN_SPLIT,EVAL_SPLIT,COLLATOR,FORWARD,LOSS,BACKWARD,OPTIMIZE training
    class MODEL_FILES,TRAINING_LOGS,CHECKPOINTS output
    class HF_REPO,MODEL_CARD,METADATA publishing
    class SPACE_REPO,DEMO_APP,ENV_VARS deployment
    class GRANARY,HF_COMM external
```

## Data Flow Overview

This diagram illustrates the complete data flow through the Voxtral ASR Fine-tuning application, from user input to deployed demo.

### Data Input Sources

#### User-Generated Data
- **Microphone Recording**: Raw audio captured through browser microphone
- **File Upload**: Existing WAV/FLAC audio files
- **Manual Transcripts**: User-provided text transcriptions
- **Language Selection**: Influences phrase selection from NVIDIA Granary

#### External Data Sources
- **NVIDIA Granary**: High-quality multilingual ASR dataset
- **HF Community Datasets**: Public datasets from Hugging Face Hub

### Data Processing Pipeline

#### Audio Processing
```python
# Audio resampling and format conversion
audio_data = librosa.load(audio_path, sr=16000)
# Convert to WAV format for consistency
sf.write(output_path, audio_data, 16000)
```

#### Text Processing
```python
# Text cleaning and validation
text = text.strip()
# Basic validation (length, content checks)
assert len(text) > 0, "Empty transcription"
```

#### JSONL Conversion
```python
# Standard format for all datasets
entry = {
    "audio_path": str(audio_file_path),
    "text": cleaned_transcription
}
# Write to JSONL file
with open(jsonl_path, "a") as f:
    f.write(json.dumps(entry) + "\n")
```

### Dataset Storage

#### Local Storage Structure
```
datasets/voxtral_user/
β”œβ”€β”€ data.jsonl          # Main dataset file
β”œβ”€β”€ recorded_data.jsonl # From recordings
└── wavs/              # Audio files
    β”œβ”€β”€ recording_0000.wav
    β”œβ”€β”€ recording_0001.wav
    └── ...
```

#### HF Hub Storage
- **Public Datasets**: Shareable with community
- **Version Control**: Dataset versioning and updates
- **Standard Metadata**: Automatic README generation

### Training Data Pipeline

#### Dataset Loading
```python
# Load local JSONL
ds = _load_jsonl_dataset("datasets/voxtral_user/data.jsonl")

# Load HF dataset
ds = load_dataset("username/dataset-name", split="train")
```

#### Audio Casting
```python
# Ensure consistent sampling rate
ds = ds.cast_column("audio", Audio(sampling_rate=16000))
```

#### Train/Eval Split
```python
# Create train and eval datasets
train_dataset = ds.select(range(train_count))
eval_dataset = ds.select(range(train_count, train_count + eval_count))
```

### Training Process Flow

#### Data Collation
- **VoxtralDataCollator**: Custom collator for Voxtral model
- **Audio Processing**: Convert audio to model inputs
- **Prompt Construction**: Build `[AUDIO]...[AUDIO] <transcribe>` prompts
- **Text Tokenization**: Process transcription targets
- **Masking**: Mask audio prompt tokens during training

#### Forward Pass
1. **Audio Input**: Raw audio waveforms
2. **Audio Tower**: Extract audio features
3. **Language Model**: Generate transcription autoregressively
4. **Loss Calculation**: Compare generated vs target text

#### Backward Pass & Optimization
- **Gradient Computation**: Backpropagation
- **LoRA Updates**: Update only adapter parameters (LoRA mode)
- **Full Updates**: Update all parameters (full fine-tuning)
- **Optimizer Step**: Apply gradients with learning rate scheduling

### Training Outputs

#### Model Files
- **model.safetensors**: Model weights (safetensors format)
- **config.json**: Model configuration
- **tokenizer.json**: Tokenizer configuration
- **generation_config.json**: Generation parameters

#### Training Logs
- **train_results.json**: Final training metrics
- **eval_results.json**: Evaluation results
- **training_config.json**: Training hyperparameters
- **trainer_state.json**: Training state and checkpoints

#### Checkpoints
- **checkpoint-XXX/**: Intermediate model snapshots
- **best-model/**: Best performing model
- **final-model/**: Final trained model

### Publishing Pipeline

#### HF Repository Structure
```
username/model-name/
β”œβ”€β”€ model.safetensors.index.json
β”œβ”€β”€ model-00001-of-00002.safetensors
β”œβ”€β”€ model-00002-of-00002.safetensors
β”œβ”€β”€ config.json
β”œβ”€β”€ tokenizer.json
β”œβ”€β”€ training_config.json
β”œβ”€β”€ train_results.json
β”œβ”€β”€ README.md (model card)
└── training_results/
    └── training.log
```

#### Model Card Generation
- **Template Processing**: Fill model_card.md template
- **Variable Injection**: Training config, results, metadata
- **Conditional Sections**: Handle quantized models, etc.

### Demo Deployment

#### Space Repository Structure
```
username/model-name-demo/
β”œβ”€β”€ app.py              # Gradio demo application
β”œβ”€β”€ requirements.txt    # Python dependencies
β”œβ”€β”€ README.md          # Space documentation
└── .env               # Environment variables
```

#### Environment Configuration
```python
# Space environment variables
HF_MODEL_ID=username/model-name
MODEL_NAME=Voxtral Fine-tuned Model
HF_TOKEN=read_only_token  # For model access
BRAND_OWNER_NAME=username
# ... other branding variables
```

### Data Flow Patterns

#### Streaming vs Batch Processing
- **Training Data**: Batch processing for efficiency
- **External Datasets**: Streaming loading for memory efficiency
- **User Input**: Real-time processing with immediate feedback

#### Data Validation
- **Input Validation**: Check audio format, sampling rate, text length
- **Quality Assurance**: Filter out empty or invalid entries
- **Consistency Checks**: Ensure audio-text alignment

#### Error Handling
- **Graceful Degradation**: Fallback to local data if external sources fail
- **Retry Logic**: Automatic retry for network failures
- **Logging**: Comprehensive error logging and debugging

### Performance Considerations

#### Memory Management
- **Streaming Loading**: Process large datasets without loading everything
- **Audio Caching**: Cache processed audio features
- **Batch Optimization**: Balance batch size with available memory

#### Storage Optimization
- **Compression**: Use efficient audio formats
- **Deduplication**: Avoid duplicate data entries
- **Cleanup**: Remove temporary files after processing

#### Network Efficiency
- **Incremental Uploads**: Upload files as they're ready
- **Resume Capability**: Resume interrupted uploads
- **Caching**: Cache frequently accessed data

### Security & Privacy

#### Data Privacy
- **Local Processing**: Audio files processed locally when possible
- **User Consent**: Clear data usage policies
- **Anonymization**: Remove personally identifiable information

#### Access Control
- **Token Management**: Secure HF token storage
- **Repository Permissions**: Appropriate public/private settings
- **Rate Limiting**: Prevent abuse of demo interfaces

### Monitoring & Analytics

#### Data Quality Metrics
- **Audio Quality**: Sampling rate, format validation
- **Text Quality**: Length, language detection, consistency
- **Dataset Statistics**: Size, distribution, coverage

#### Performance Metrics
- **Processing Time**: Data loading, preprocessing, training time
- **Model Metrics**: Loss, perplexity, WER (if available)
- **Resource Usage**: Memory, CPU/GPU utilization

#### User Analytics
- **Usage Patterns**: Popular languages, dataset sizes
- **Success Rates**: Training completion, deployment success
- **Error Patterns**: Common failure modes and solutions

See also:
- [Architecture Overview](architecture.md)
- [Interface Workflow](interface-workflow.md)
- [Training Pipeline](training-pipeline.md)