Spaces:
Running
Running
File size: 12,003 Bytes
a3a3978 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 |
# Data Flow
```mermaid
flowchart TD
%% User Input Sources
subgraph "User Input"
MIC[π€ Microphone Recording<br/>Raw audio + timestamps]
FILE[π File Upload<br/>WAV/FLAC files]
TEXT[π Manual Transcripts<br/>Text input]
LANG[π Language Selection<br/>25+ languages]
end
%% Data Processing Pipeline
subgraph "Data Processing"
AUDIO_PROC[Audio Processing<br/>Resampling to 16kHz<br/>Format conversion]
TEXT_PROC[Text Processing<br/>Transcript validation<br/>Cleaning & formatting]
JSONL_CONV[JSONL Conversion<br/>{"audio_path": "...", "text": "..."}]
end
%% Dataset Storage
subgraph "Dataset Storage"
LOCAL_DS[Local Dataset<br/>datasets/voxtral_user/<br/>data.jsonl + wavs/]
HF_DS[HF Hub Dataset<br/>username/dataset-name<br/>Public sharing]
end
%% Training Data Flow
subgraph "Training Data Pipeline"
DS_LOADER[Dataset Loader<br/>_load_jsonl_dataset()<br/>or load_dataset()]
AUDIO_CAST[Audio Casting<br/>Audio(sampling_rate=16000)]
TRAIN_SPLIT[Train Split<br/>train_dataset]
EVAL_SPLIT[Eval Split<br/>eval_dataset]
end
%% Model Training
subgraph "Model Training"
COLLATOR[VoxtralDataCollator<br/>Audio + Text batching<br/>Prompt construction]
FORWARD[Forward Pass<br/>Audio β Features β Text]
LOSS[Loss Calculation<br/>Masked LM loss]
BACKWARD[Backward Pass<br/>Gradient computation]
OPTIMIZE[Parameter Updates<br/>LoRA or full fine-tuning]
end
%% Training Outputs
subgraph "Training Outputs"
MODEL_FILES[Model Files<br/>model.safetensors<br/>config.json<br/>tokenizer.json]
TRAINING_LOGS[Training Logs<br/>train_results.json<br/>training_config.json<br/>loss curves]
CHECKPOINTS[Checkpoints<br/>Intermediate models<br/>best model tracking]
end
%% Publishing Pipeline
subgraph "Publishing Pipeline"
HF_REPO[HF Repository<br/>username/model-name<br/>Model hosting]
MODEL_CARD[Model Card<br/>README.md<br/>Training details<br/>Usage examples]
METADATA[Training Metadata<br/>Config + results<br/>Performance metrics]
end
%% Demo Deployment
subgraph "Demo Deployment"
SPACE_REPO[HF Space Repository<br/>username/model-name-demo<br/>Demo hosting]
DEMO_APP[Demo Application<br/>Gradio interface<br/>Real-time inference]
ENV_VARS[Environment Config<br/>HF_MODEL_ID<br/>MODEL_NAME<br/>secrets]
end
%% External Data Sources
subgraph "External Data Sources"
GRANARY[NVIDIA Granary<br/>Multilingual ASR data<br/>25+ languages]
HF_COMM[HF Community Datasets<br/>Public ASR datasets<br/>Standard formats]
end
%% Data Flow Connections
MIC --> AUDIO_PROC
FILE --> AUDIO_PROC
TEXT --> TEXT_PROC
LANG --> TEXT_PROC
AUDIO_PROC --> JSONL_CONV
TEXT_PROC --> JSONL_CONV
JSONL_CONV --> LOCAL_DS
LOCAL_DS --> HF_DS
LOCAL_DS --> DS_LOADER
HF_DS --> DS_LOADER
GRANARY --> DS_LOADER
HF_COMM --> DS_LOADER
DS_LOADER --> AUDIO_CAST
AUDIO_CAST --> TRAIN_SPLIT
AUDIO_CAST --> EVAL_SPLIT
TRAIN_SPLIT --> COLLATOR
EVAL_SPLIT --> COLLATOR
COLLATOR --> FORWARD
FORWARD --> LOSS
LOSS --> BACKWARD
BACKWARD --> OPTIMIZE
OPTIMIZE --> MODEL_FILES
OPTIMIZE --> TRAINING_LOGS
OPTIMIZE --> CHECKPOINTS
MODEL_FILES --> HF_REPO
TRAINING_LOGS --> HF_REPO
CHECKPOINTS --> HF_REPO
HF_REPO --> MODEL_CARD
TRAINING_LOGS --> MODEL_CARD
MODEL_CARD --> SPACE_REPO
HF_REPO --> SPACE_REPO
ENV_VARS --> SPACE_REPO
SPACE_REPO --> DEMO_APP
%% Styling
classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
classDef processing fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
classDef storage fill:#fff3e0,stroke:#f57c00,stroke-width:2px
classDef training fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
classDef output fill:#fce4ec,stroke:#c2185b,stroke-width:2px
classDef publishing fill:#e1f5fe,stroke:#0277bd,stroke-width:2px
classDef deployment fill:#f5f5f5,stroke:#424242,stroke-width:2px
classDef external fill:#efebe9,stroke:#5d4037,stroke-width:2px
class MIC,FILE,TEXT,LANG input
class AUDIO_PROC,TEXT_PROC,JSONL_CONV processing
class LOCAL_DS,HF_DS storage
class DS_LOADER,AUDIO_CAST,TRAIN_SPLIT,EVAL_SPLIT,COLLATOR,FORWARD,LOSS,BACKWARD,OPTIMIZE training
class MODEL_FILES,TRAINING_LOGS,CHECKPOINTS output
class HF_REPO,MODEL_CARD,METADATA publishing
class SPACE_REPO,DEMO_APP,ENV_VARS deployment
class GRANARY,HF_COMM external
```
## Data Flow Overview
This diagram illustrates the complete data flow through the Voxtral ASR Fine-tuning application, from user input to deployed demo.
### Data Input Sources
#### User-Generated Data
- **Microphone Recording**: Raw audio captured through browser microphone
- **File Upload**: Existing WAV/FLAC audio files
- **Manual Transcripts**: User-provided text transcriptions
- **Language Selection**: Influences phrase selection from NVIDIA Granary
#### External Data Sources
- **NVIDIA Granary**: High-quality multilingual ASR dataset
- **HF Community Datasets**: Public datasets from Hugging Face Hub
### Data Processing Pipeline
#### Audio Processing
```python
# Audio resampling and format conversion
audio_data = librosa.load(audio_path, sr=16000)
# Convert to WAV format for consistency
sf.write(output_path, audio_data, 16000)
```
#### Text Processing
```python
# Text cleaning and validation
text = text.strip()
# Basic validation (length, content checks)
assert len(text) > 0, "Empty transcription"
```
#### JSONL Conversion
```python
# Standard format for all datasets
entry = {
"audio_path": str(audio_file_path),
"text": cleaned_transcription
}
# Write to JSONL file
with open(jsonl_path, "a") as f:
f.write(json.dumps(entry) + "\n")
```
### Dataset Storage
#### Local Storage Structure
```
datasets/voxtral_user/
βββ data.jsonl # Main dataset file
βββ recorded_data.jsonl # From recordings
βββ wavs/ # Audio files
βββ recording_0000.wav
βββ recording_0001.wav
βββ ...
```
#### HF Hub Storage
- **Public Datasets**: Shareable with community
- **Version Control**: Dataset versioning and updates
- **Standard Metadata**: Automatic README generation
### Training Data Pipeline
#### Dataset Loading
```python
# Load local JSONL
ds = _load_jsonl_dataset("datasets/voxtral_user/data.jsonl")
# Load HF dataset
ds = load_dataset("username/dataset-name", split="train")
```
#### Audio Casting
```python
# Ensure consistent sampling rate
ds = ds.cast_column("audio", Audio(sampling_rate=16000))
```
#### Train/Eval Split
```python
# Create train and eval datasets
train_dataset = ds.select(range(train_count))
eval_dataset = ds.select(range(train_count, train_count + eval_count))
```
### Training Process Flow
#### Data Collation
- **VoxtralDataCollator**: Custom collator for Voxtral model
- **Audio Processing**: Convert audio to model inputs
- **Prompt Construction**: Build `[AUDIO]...[AUDIO] <transcribe>` prompts
- **Text Tokenization**: Process transcription targets
- **Masking**: Mask audio prompt tokens during training
#### Forward Pass
1. **Audio Input**: Raw audio waveforms
2. **Audio Tower**: Extract audio features
3. **Language Model**: Generate transcription autoregressively
4. **Loss Calculation**: Compare generated vs target text
#### Backward Pass & Optimization
- **Gradient Computation**: Backpropagation
- **LoRA Updates**: Update only adapter parameters (LoRA mode)
- **Full Updates**: Update all parameters (full fine-tuning)
- **Optimizer Step**: Apply gradients with learning rate scheduling
### Training Outputs
#### Model Files
- **model.safetensors**: Model weights (safetensors format)
- **config.json**: Model configuration
- **tokenizer.json**: Tokenizer configuration
- **generation_config.json**: Generation parameters
#### Training Logs
- **train_results.json**: Final training metrics
- **eval_results.json**: Evaluation results
- **training_config.json**: Training hyperparameters
- **trainer_state.json**: Training state and checkpoints
#### Checkpoints
- **checkpoint-XXX/**: Intermediate model snapshots
- **best-model/**: Best performing model
- **final-model/**: Final trained model
### Publishing Pipeline
#### HF Repository Structure
```
username/model-name/
βββ model.safetensors.index.json
βββ model-00001-of-00002.safetensors
βββ model-00002-of-00002.safetensors
βββ config.json
βββ tokenizer.json
βββ training_config.json
βββ train_results.json
βββ README.md (model card)
βββ training_results/
βββ training.log
```
#### Model Card Generation
- **Template Processing**: Fill model_card.md template
- **Variable Injection**: Training config, results, metadata
- **Conditional Sections**: Handle quantized models, etc.
### Demo Deployment
#### Space Repository Structure
```
username/model-name-demo/
βββ app.py # Gradio demo application
βββ requirements.txt # Python dependencies
βββ README.md # Space documentation
βββ .env # Environment variables
```
#### Environment Configuration
```python
# Space environment variables
HF_MODEL_ID=username/model-name
MODEL_NAME=Voxtral Fine-tuned Model
HF_TOKEN=read_only_token # For model access
BRAND_OWNER_NAME=username
# ... other branding variables
```
### Data Flow Patterns
#### Streaming vs Batch Processing
- **Training Data**: Batch processing for efficiency
- **External Datasets**: Streaming loading for memory efficiency
- **User Input**: Real-time processing with immediate feedback
#### Data Validation
- **Input Validation**: Check audio format, sampling rate, text length
- **Quality Assurance**: Filter out empty or invalid entries
- **Consistency Checks**: Ensure audio-text alignment
#### Error Handling
- **Graceful Degradation**: Fallback to local data if external sources fail
- **Retry Logic**: Automatic retry for network failures
- **Logging**: Comprehensive error logging and debugging
### Performance Considerations
#### Memory Management
- **Streaming Loading**: Process large datasets without loading everything
- **Audio Caching**: Cache processed audio features
- **Batch Optimization**: Balance batch size with available memory
#### Storage Optimization
- **Compression**: Use efficient audio formats
- **Deduplication**: Avoid duplicate data entries
- **Cleanup**: Remove temporary files after processing
#### Network Efficiency
- **Incremental Uploads**: Upload files as they're ready
- **Resume Capability**: Resume interrupted uploads
- **Caching**: Cache frequently accessed data
### Security & Privacy
#### Data Privacy
- **Local Processing**: Audio files processed locally when possible
- **User Consent**: Clear data usage policies
- **Anonymization**: Remove personally identifiable information
#### Access Control
- **Token Management**: Secure HF token storage
- **Repository Permissions**: Appropriate public/private settings
- **Rate Limiting**: Prevent abuse of demo interfaces
### Monitoring & Analytics
#### Data Quality Metrics
- **Audio Quality**: Sampling rate, format validation
- **Text Quality**: Length, language detection, consistency
- **Dataset Statistics**: Size, distribution, coverage
#### Performance Metrics
- **Processing Time**: Data loading, preprocessing, training time
- **Model Metrics**: Loss, perplexity, WER (if available)
- **Resource Usage**: Memory, CPU/GPU utilization
#### User Analytics
- **Usage Patterns**: Popular languages, dataset sizes
- **Success Rates**: Training completion, deployment success
- **Error Patterns**: Common failure modes and solutions
See also:
- [Architecture Overview](architecture.md)
- [Interface Workflow](interface-workflow.md)
- [Training Pipeline](training-pipeline.md)
|