Spaces:

Tonic
/

VoxFactory

Running

File size: 8,659 Bytes

a3a3978

# Training Pipeline

```mermaid
graph TB
    %% Input Data Sources
    subgraph "Data Sources"
        JSONL[JSONL Dataset<br/>{"audio_path": "...", "text": "..."}]
        GRANARY[NVIDIA Granary Dataset<br/>Multilingual ASR Data]
        HFDATA[HF Hub Datasets<br/>Community Datasets]
    end

    %% Data Processing
    subgraph "Data Processing"
        LOADER[Dataset Loader<br/>_load_jsonl_dataset()]
        CASTER[Audio Casting<br/>16kHz resampling]
        COLLATOR[VoxtralDataCollator<br/>Audio + Text Processing]
    end

    %% Training Scripts
    subgraph "Training Scripts"
        TRAIN_FULL[Full Fine-tuning<br/>scripts/train.py]
        TRAIN_LORA[LoRA Fine-tuning<br/>scripts/train_lora.py]

        subgraph "Training Components"
            MODEL_INIT[Model Initialization<br/>VoxtralForConditionalGeneration]
            LORA_CONFIG[LoRA Configuration<br/>LoraConfig + get_peft_model]
            PROCESSOR_INIT[Processor Initialization<br/>VoxtralProcessor]
        end
    end

    %% Training Infrastructure
    subgraph "Training Infrastructure"
        TRACKIO_INIT[Trackio Integration<br/>Experiment Tracking]
        HF_TRAINER[Hugging Face Trainer<br/>TrainingArguments + Trainer]
        TORCH_DEVICE[Torch Device Setup<br/>GPU/CPU Detection]
    end

    %% Training Process
    subgraph "Training Process"
        FORWARD_PASS[Forward Pass<br/>Audio Processing + Generation]
        LOSS_CALC[Loss Calculation<br/>Masked Language Modeling]
        BACKWARD_PASS[Backward Pass<br/>Gradient Computation]
        OPTIMIZER_STEP[Optimizer Step<br/>Parameter Updates]
        LOGGING[Metrics Logging<br/>Loss, Perplexity, etc.]
    end

    %% Model Management
    subgraph "Model Management"
        CHECKPOINT_SAVING[Checkpoint Saving<br/>Model snapshots]
        MODEL_SAVING[Final Model Saving<br/>Processor + Model]
        LOCAL_STORAGE[Local Storage<br/>outputs/ directory]
    end

    %% Flow Connections
    JSONL --> LOADER
    GRANARY --> LOADER
    HFDATA --> LOADER

    LOADER --> CASTER
    CASTER --> COLLATOR

    COLLATOR --> TRAIN_FULL
    COLLATOR --> TRAIN_LORA

    TRAIN_FULL --> MODEL_INIT
    TRAIN_LORA --> MODEL_INIT
    TRAIN_LORA --> LORA_CONFIG

    MODEL_INIT --> PROCESSOR_INIT
    LORA_CONFIG --> PROCESSOR_INIT

    PROCESSOR_INIT --> TRACKIO_INIT
    PROCESSOR_INIT --> HF_TRAINER
    PROCESSOR_INIT --> TORCH_DEVICE

    TRACKIO_INIT --> HF_TRAINER
    TORCH_DEVICE --> HF_TRAINER

    HF_TRAINER --> FORWARD_PASS
    FORWARD_PASS --> LOSS_CALC
    LOSS_CALC --> BACKWARD_PASS
    BACKWARD_PASS --> OPTIMIZER_STEP
    OPTIMIZER_STEP --> LOGGING

    LOGGING --> CHECKPOINT_SAVING
    LOGGING --> TRACKIO_INIT

    HF_TRAINER --> MODEL_SAVING
    MODEL_SAVING --> LOCAL_STORAGE

    %% Styling
    classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    classDef processing fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
    classDef training fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
    classDef infrastructure fill:#fff3e0,stroke:#f57c00,stroke-width:2px
    classDef execution fill:#fce4ec,stroke:#c2185b,stroke-width:2px
    classDef output fill:#f5f5f5,stroke:#424242,stroke-width:2px

    class JSONL,GRANARY,HFDATA input
    class LOADER,CASTER,COLLATOR processing
    class TRAIN_FULL,TRAIN_LORA,MODEL_INIT,LORA_CONFIG,PROCESSOR_INIT training
    class TRACKIO_INIT,HF_TRAINER,TORCH_DEVICE infrastructure
    class FORWARD_PASS,LOSS_CALC,BACKWARD_PASS,OPTIMIZER_STEP,LOGGING execution
    class CHECKPOINT_SAVING,MODEL_SAVING,LOCAL_STORAGE output
```

## Training Pipeline Overview

This diagram illustrates the complete training pipeline for Voxtral ASR fine-tuning, showing how data flows through the training scripts and supporting infrastructure.

### Data Input Sources

#### JSONL Datasets
- **Local Datasets**: User-created datasets from recordings or uploads
- **Format**: `{"audio_path": "path/to/audio.wav", "text": "transcription"}`
- **Processing**: Loaded via `_load_jsonl_dataset()` function

#### NVIDIA Granary Dataset
- **Multilingual Support**: 25+ European languages
- **High Quality**: Curated ASR training data
- **Streaming**: Efficient loading without full download

#### Hugging Face Hub Datasets
- **Community Datasets**: Public datasets from HF Hub
- **Standard Formats**: Compatible with Voxtral training requirements

### Data Processing Pipeline

#### Dataset Loading
```python
# Load local JSONL or HF dataset
ds = _load_jsonl_dataset(jsonl_path)
# or
ds = load_dataset(ds_name, ds_cfg, split="test")
```

#### Audio Processing
```python
# Cast to Audio format with 16kHz resampling
ds = ds.cast_column("audio", Audio(sampling_rate=16000))
```

#### Data Collation
- **VoxtralDataCollator**: Custom collator for Voxtral training
- **Audio Processing**: Converts audio to model inputs
- **Text Tokenization**: Processes transcription text
- **Masking**: Masks prompt tokens during training

### Training Script Architecture

#### Full Fine-tuning (`train.py`)
- **Complete Model Updates**: All parameters trainable
- **Higher Memory Requirements**: Full model in memory
- **Better Convergence**: Can achieve higher accuracy

#### LoRA Fine-tuning (`train_lora.py`)
- **Parameter Efficient**: Only LoRA adapters trained
- **Lower Memory Usage**: Base model frozen
- **Faster Training**: Fewer parameters to update
- **Configurable**: r, alpha, dropout parameters

### Training Infrastructure

#### Trackio Integration
```python
trackio.init(
    project="voxtral-finetuning",
    config={...},  # Training parameters
    space_id=trackio_space
)
```

#### Hugging Face Trainer
```python
training_args = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=batch_size,
    learning_rate=learning_rate,
    num_train_epochs=epochs,
    bf16=True,  # BFloat16 for efficiency
    report_to=["trackio"],
    # ... other args
)
```

#### Device Management
- **GPU Detection**: Automatic CUDA/GPU detection
- **Fallback**: CPU training if no GPU available
- **Memory Optimization**: Model sharding and gradient checkpointing

### Training Process Flow

#### Forward Pass
1. **Audio Input**: Raw audio waveforms
2. **Audio Tower**: Audio feature extraction
3. **Text Generation**: Autoregressive text generation from audio features

#### Loss Calculation
- **Masked Language Modeling**: Only transcription tokens contribute to loss
- **Audio Prompt Masking**: Audio processing tokens are masked out
- **Cross-Entropy Loss**: Standard language modeling loss

#### Backward Pass & Optimization
- **Gradient Computation**: Backpropagation through the model
- **LoRA Updates**: Only adapter parameters updated (LoRA mode)
- **Full Updates**: All parameters updated (full fine-tuning)

### Model Management

#### Checkpoint Saving
- **Regular Checkpoints**: Saved every N steps
- **Best Model Tracking**: Save best model based on validation loss
- **Resume Capability**: Continue training from checkpoints

#### Final Model Saving
```python
trainer.save_model()  # Saves model and tokenizer
processor.save_pretrained(output_dir)  # Saves processor
```

#### Local Storage Structure
```
outputs/
├── voxtral-finetuned-{timestamp}/
│   ├── config.json
│   ├── model.safetensors
│   ├── tokenizer.json
│   ├── training_config.json
│   ├── train_results.json
│   └── eval_results.json
```

### Integration Points

#### With Interface (`interface.py`)
- **Parameter Passing**: Training parameters from UI
- **Log Streaming**: Real-time training logs to UI
- **Progress Monitoring**: Training progress updates

#### With Model Publishing (`push_to_huggingface.py`)
- **Model Upload**: Trained model to HF Hub
- **Metadata**: Training config and results
- **Model Cards**: Automatic model card generation

#### With Demo Deployment (`deploy_demo_space.py`)
- **Space Creation**: HF Spaces for demos
- **Model Integration**: Deploy trained model in demo
- **Configuration**: Demo-specific settings

### Performance Considerations

#### Memory Optimization
- **LoRA**: Significantly reduces memory requirements
- **Gradient Checkpointing**: Trade compute for memory
- **Mixed Precision**: BF16/FP16 training

#### Training Efficiency
- **Batch Size**: Balanced with gradient accumulation
- **Learning Rate**: Warmup and decay schedules
- **Early Stopping**: Prevent overfitting

#### Monitoring & Debugging
- **Metrics Tracking**: Loss, perplexity, learning rate
- **GPU Utilization**: Memory and compute monitoring
- **Error Handling**: Graceful failure recovery

See also:
- [Architecture Overview](architecture.md)
- [Interface Workflow](interface-workflow.md)
- [Data Flow](data-flow.md)