# Data Flow ```mermaid flowchart TD %% User Input Sources subgraph "User Input" MIC[🎤 Microphone Recording
Raw audio + timestamps] FILE[📁 File Upload
WAV/FLAC files] TEXT[📝 Manual Transcripts
Text input] LANG[🌍 Language Selection
25+ languages] end %% Data Processing Pipeline subgraph "Data Processing" AUDIO_PROC[Audio Processing
Resampling to 16kHz
Format conversion] TEXT_PROC[Text Processing
Transcript validation
Cleaning & formatting] JSONL_CONV[JSONL Conversion
{"audio_path": "...", "text": "..."}] end %% Dataset Storage subgraph "Dataset Storage" LOCAL_DS[Local Dataset
datasets/voxtral_user/
data.jsonl + wavs/] HF_DS[HF Hub Dataset
username/dataset-name
Public sharing] end %% Training Data Flow subgraph "Training Data Pipeline" DS_LOADER[Dataset Loader
_load_jsonl_dataset()
or load_dataset()] AUDIO_CAST[Audio Casting
Audio(sampling_rate=16000)] TRAIN_SPLIT[Train Split
train_dataset] EVAL_SPLIT[Eval Split
eval_dataset] end %% Model Training subgraph "Model Training" COLLATOR[VoxtralDataCollator
Audio + Text batching
Prompt construction] FORWARD[Forward Pass
Audio → Features → Text] LOSS[Loss Calculation
Masked LM loss] BACKWARD[Backward Pass
Gradient computation] OPTIMIZE[Parameter Updates
LoRA or full fine-tuning] end %% Training Outputs subgraph "Training Outputs" MODEL_FILES[Model Files
model.safetensors
config.json
tokenizer.json] TRAINING_LOGS[Training Logs
train_results.json
training_config.json
loss curves] CHECKPOINTS[Checkpoints
Intermediate models
best model tracking] end %% Publishing Pipeline subgraph "Publishing Pipeline" HF_REPO[HF Repository
username/model-name
Model hosting] MODEL_CARD[Model Card
README.md
Training details
Usage examples] METADATA[Training Metadata
Config + results
Performance metrics] end %% Demo Deployment subgraph "Demo Deployment" SPACE_REPO[HF Space Repository
username/model-name-demo
Demo hosting] DEMO_APP[Demo Application
Gradio interface
Real-time inference] ENV_VARS[Environment Config
HF_MODEL_ID
MODEL_NAME
secrets] end %% External Data Sources subgraph "External Data Sources" GRANARY[NVIDIA Granary
Multilingual ASR data
25+ languages] HF_COMM[HF Community Datasets
Public ASR datasets
Standard formats] end %% Data Flow Connections MIC --> AUDIO_PROC FILE --> AUDIO_PROC TEXT --> TEXT_PROC LANG --> TEXT_PROC AUDIO_PROC --> JSONL_CONV TEXT_PROC --> JSONL_CONV JSONL_CONV --> LOCAL_DS LOCAL_DS --> HF_DS LOCAL_DS --> DS_LOADER HF_DS --> DS_LOADER GRANARY --> DS_LOADER HF_COMM --> DS_LOADER DS_LOADER --> AUDIO_CAST AUDIO_CAST --> TRAIN_SPLIT AUDIO_CAST --> EVAL_SPLIT TRAIN_SPLIT --> COLLATOR EVAL_SPLIT --> COLLATOR COLLATOR --> FORWARD FORWARD --> LOSS LOSS --> BACKWARD BACKWARD --> OPTIMIZE OPTIMIZE --> MODEL_FILES OPTIMIZE --> TRAINING_LOGS OPTIMIZE --> CHECKPOINTS MODEL_FILES --> HF_REPO TRAINING_LOGS --> HF_REPO CHECKPOINTS --> HF_REPO HF_REPO --> MODEL_CARD TRAINING_LOGS --> MODEL_CARD MODEL_CARD --> SPACE_REPO HF_REPO --> SPACE_REPO ENV_VARS --> SPACE_REPO SPACE_REPO --> DEMO_APP %% Styling classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px classDef processing fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px classDef storage fill:#fff3e0,stroke:#f57c00,stroke-width:2px classDef training fill:#e8f5e8,stroke:#388e3c,stroke-width:2px classDef output fill:#fce4ec,stroke:#c2185b,stroke-width:2px classDef publishing fill:#e1f5fe,stroke:#0277bd,stroke-width:2px classDef deployment fill:#f5f5f5,stroke:#424242,stroke-width:2px classDef external fill:#efebe9,stroke:#5d4037,stroke-width:2px class MIC,FILE,TEXT,LANG input class AUDIO_PROC,TEXT_PROC,JSONL_CONV processing class LOCAL_DS,HF_DS storage class DS_LOADER,AUDIO_CAST,TRAIN_SPLIT,EVAL_SPLIT,COLLATOR,FORWARD,LOSS,BACKWARD,OPTIMIZE training class MODEL_FILES,TRAINING_LOGS,CHECKPOINTS output class HF_REPO,MODEL_CARD,METADATA publishing class SPACE_REPO,DEMO_APP,ENV_VARS deployment class GRANARY,HF_COMM external ``` ## Data Flow Overview This diagram illustrates the complete data flow through the Voxtral ASR Fine-tuning application, from user input to deployed demo. ### Data Input Sources #### User-Generated Data - **Microphone Recording**: Raw audio captured through browser microphone - **File Upload**: Existing WAV/FLAC audio files - **Manual Transcripts**: User-provided text transcriptions - **Language Selection**: Influences phrase selection from NVIDIA Granary #### External Data Sources - **NVIDIA Granary**: High-quality multilingual ASR dataset - **HF Community Datasets**: Public datasets from Hugging Face Hub ### Data Processing Pipeline #### Audio Processing ```python # Audio resampling and format conversion audio_data = librosa.load(audio_path, sr=16000) # Convert to WAV format for consistency sf.write(output_path, audio_data, 16000) ``` #### Text Processing ```python # Text cleaning and validation text = text.strip() # Basic validation (length, content checks) assert len(text) > 0, "Empty transcription" ``` #### JSONL Conversion ```python # Standard format for all datasets entry = { "audio_path": str(audio_file_path), "text": cleaned_transcription } # Write to JSONL file with open(jsonl_path, "a") as f: f.write(json.dumps(entry) + "\n") ``` ### Dataset Storage #### Local Storage Structure ``` datasets/voxtral_user/ ├── data.jsonl # Main dataset file ├── recorded_data.jsonl # From recordings └── wavs/ # Audio files ├── recording_0000.wav ├── recording_0001.wav └── ... ``` #### HF Hub Storage - **Public Datasets**: Shareable with community - **Version Control**: Dataset versioning and updates - **Standard Metadata**: Automatic README generation ### Training Data Pipeline #### Dataset Loading ```python # Load local JSONL ds = _load_jsonl_dataset("datasets/voxtral_user/data.jsonl") # Load HF dataset ds = load_dataset("username/dataset-name", split="train") ``` #### Audio Casting ```python # Ensure consistent sampling rate ds = ds.cast_column("audio", Audio(sampling_rate=16000)) ``` #### Train/Eval Split ```python # Create train and eval datasets train_dataset = ds.select(range(train_count)) eval_dataset = ds.select(range(train_count, train_count + eval_count)) ``` ### Training Process Flow #### Data Collation - **VoxtralDataCollator**: Custom collator for Voxtral model - **Audio Processing**: Convert audio to model inputs - **Prompt Construction**: Build `[AUDIO]...[AUDIO] ` prompts - **Text Tokenization**: Process transcription targets - **Masking**: Mask audio prompt tokens during training #### Forward Pass 1. **Audio Input**: Raw audio waveforms 2. **Audio Tower**: Extract audio features 3. **Language Model**: Generate transcription autoregressively 4. **Loss Calculation**: Compare generated vs target text #### Backward Pass & Optimization - **Gradient Computation**: Backpropagation - **LoRA Updates**: Update only adapter parameters (LoRA mode) - **Full Updates**: Update all parameters (full fine-tuning) - **Optimizer Step**: Apply gradients with learning rate scheduling ### Training Outputs #### Model Files - **model.safetensors**: Model weights (safetensors format) - **config.json**: Model configuration - **tokenizer.json**: Tokenizer configuration - **generation_config.json**: Generation parameters #### Training Logs - **train_results.json**: Final training metrics - **eval_results.json**: Evaluation results - **training_config.json**: Training hyperparameters - **trainer_state.json**: Training state and checkpoints #### Checkpoints - **checkpoint-XXX/**: Intermediate model snapshots - **best-model/**: Best performing model - **final-model/**: Final trained model ### Publishing Pipeline #### HF Repository Structure ``` username/model-name/ ├── model.safetensors.index.json ├── model-00001-of-00002.safetensors ├── model-00002-of-00002.safetensors ├── config.json ├── tokenizer.json ├── training_config.json ├── train_results.json ├── README.md (model card) └── training_results/ └── training.log ``` #### Model Card Generation - **Template Processing**: Fill model_card.md template - **Variable Injection**: Training config, results, metadata - **Conditional Sections**: Handle quantized models, etc. ### Demo Deployment #### Space Repository Structure ``` username/model-name-demo/ ├── app.py # Gradio demo application ├── requirements.txt # Python dependencies ├── README.md # Space documentation └── .env # Environment variables ``` #### Environment Configuration ```python # Space environment variables HF_MODEL_ID=username/model-name MODEL_NAME=Voxtral Fine-tuned Model HF_TOKEN=read_only_token # For model access BRAND_OWNER_NAME=username # ... other branding variables ``` ### Data Flow Patterns #### Streaming vs Batch Processing - **Training Data**: Batch processing for efficiency - **External Datasets**: Streaming loading for memory efficiency - **User Input**: Real-time processing with immediate feedback #### Data Validation - **Input Validation**: Check audio format, sampling rate, text length - **Quality Assurance**: Filter out empty or invalid entries - **Consistency Checks**: Ensure audio-text alignment #### Error Handling - **Graceful Degradation**: Fallback to local data if external sources fail - **Retry Logic**: Automatic retry for network failures - **Logging**: Comprehensive error logging and debugging ### Performance Considerations #### Memory Management - **Streaming Loading**: Process large datasets without loading everything - **Audio Caching**: Cache processed audio features - **Batch Optimization**: Balance batch size with available memory #### Storage Optimization - **Compression**: Use efficient audio formats - **Deduplication**: Avoid duplicate data entries - **Cleanup**: Remove temporary files after processing #### Network Efficiency - **Incremental Uploads**: Upload files as they're ready - **Resume Capability**: Resume interrupted uploads - **Caching**: Cache frequently accessed data ### Security & Privacy #### Data Privacy - **Local Processing**: Audio files processed locally when possible - **User Consent**: Clear data usage policies - **Anonymization**: Remove personally identifiable information #### Access Control - **Token Management**: Secure HF token storage - **Repository Permissions**: Appropriate public/private settings - **Rate Limiting**: Prevent abuse of demo interfaces ### Monitoring & Analytics #### Data Quality Metrics - **Audio Quality**: Sampling rate, format validation - **Text Quality**: Length, language detection, consistency - **Dataset Statistics**: Size, distribution, coverage #### Performance Metrics - **Processing Time**: Data loading, preprocessing, training time - **Model Metrics**: Loss, perplexity, WER (if available) - **Resource Usage**: Memory, CPU/GPU utilization #### User Analytics - **Usage Patterns**: Popular languages, dataset sizes - **Success Rates**: Training completion, deployment success - **Error Patterns**: Common failure modes and solutions See also: - [Architecture Overview](architecture.md) - [Interface Workflow](interface-workflow.md) - [Training Pipeline](training-pipeline.md)