File size: 4,121 Bytes
a3a3978
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
# Voxtral ASR Fine-tuning Architecture

```mermaid
graph TB
    %% User Interface Layer
    subgraph "User Interface"
        UI[Gradio Web Interface<br/>interface.py]
        REC[Audio Recording<br/>Microphone Input]
        UP[File Upload<br/>WAV/FLAC files]
    end

    %% Data Processing Layer
    subgraph "Data Processing"
        DP[Data Processing<br/>Audio resampling<br/>JSONL creation]
        DS[Dataset Management<br/>NVIDIA Granary<br/>Local datasets]
    end

    %% Training Layer
    subgraph "Training Pipeline"
        TF[Full Fine-tuning<br/>scripts/train.py]
        TL[LoRA Fine-tuning<br/>scripts/train_lora.py]
        TI[Trackio Integration<br/>Experiment Tracking]
    end

    %% Model Management Layer
    subgraph "Model Management"
        MM[Model Management<br/>Hugging Face Hub<br/>Local storage]
        MC[Model Card Generation<br/>scripts/generate_model_card.py]
    end

    %% Deployment Layer
    subgraph "Deployment & Demo"
        DEP[Demo Space Deployment<br/>scripts/deploy_demo_space.py]
        HF[HF Spaces<br/>Interactive Demo]
    end

    %% External Services
    subgraph "External Services"
        HFH[Hugging Face Hub<br/>Models & Datasets]
        GRAN[NVIDIA Granary<br/>Multilingual ASR Dataset]
        TRACK[Trackio Spaces<br/>Experiment Tracking]
    end

    %% Data Flow
    UI --> DP
    REC --> DP
    UP --> DP
    DP --> DS

    DS --> TF
    DS --> TL
    TF --> TI
    TL --> TI

    TF --> MM
    TL --> MM
    MM --> MC

    MM --> DEP
    DEP --> HF

    DS -.-> HFH
    MM -.-> HFH
    TI -.-> TRACK
    DS -.-> GRAN

    %% Styling
    classDef interface fill:#e1f5fe,stroke:#01579b,stroke-width:2px
    classDef processing fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
    classDef training fill:#e8f5e8,stroke:#1b5e20,stroke-width:2px
    classDef management fill:#fff3e0,stroke:#e65100,stroke-width:2px
    classDef deployment fill:#fce4ec,stroke:#880e4f,stroke-width:2px
    classDef external fill:#f5f5f5,stroke:#424242,stroke-width:2px

    class UI,REC,UP interface
    class DP,DS processing
    class TF,TL,TI training
    class MM,MC management
    class DEP,HF deployment
    class HFH,GRAN,TRACK external
```

## Architecture Overview

This diagram shows the high-level architecture of the Voxtral ASR Fine-tuning application. The system is organized into several layers:

### 1. User Interface Layer
- **Gradio Web Interface**: Main user-facing application built with Gradio
- **Audio Recording**: Microphone input for recording speech samples
- **File Upload**: Support for uploading existing WAV/FLAC audio files

### 2. Data Processing Layer
- **Data Processing**: Audio resampling to 16kHz, JSONL dataset creation
- **Dataset Management**: Integration with NVIDIA Granary dataset and local dataset handling

### 3. Training Layer
- **Full Fine-tuning**: Complete model fine-tuning using `scripts/train.py`
- **LoRA Fine-tuning**: Parameter-efficient fine-tuning using `scripts/train_lora.py`
- **Trackio Integration**: Experiment tracking and logging

### 4. Model Management Layer
- **Model Management**: Local storage and Hugging Face Hub integration
- **Model Card Generation**: Automated model card creation

### 5. Deployment Layer
- **Demo Space Deployment**: Automated deployment to Hugging Face Spaces
- **Interactive Demo**: Live demo interface for testing fine-tuned models

### 6. External Services
- **Hugging Face Hub**: Model and dataset storage and sharing
- **NVIDIA Granary**: High-quality multilingual ASR dataset
- **Trackio Spaces**: Experiment tracking and visualization

## Key Workflows

1. **Dataset Creation**: Users can record audio or upload files → processed into JSONL format
2. **Model Training**: Datasets fed into training scripts with experiment tracking
3. **Model Publishing**: Trained models pushed to HF Hub with generated model cards
4. **Demo Deployment**: Automated deployment of interactive demos to HF Spaces

See also:
- [Interface Workflow](interface-workflow.md)
- [Training Pipeline](training-pipeline.md)
- [Deployment Pipeline](deployment-pipeline.md)
- [Data Flow](data-flow.md)