File size: 7,178 Bytes
a3a3978
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
# Interface Workflow

```mermaid
stateDiagram-v2
    [*] --> LanguageSelection: User opens interface

    state "Language & Dataset Setup" as LangSetup {
        [*] --> LanguageSelection
        LanguageSelection --> LoadPhrases: Select language
        LoadPhrases --> DisplayPhrases: Load from NVIDIA Granary
        DisplayPhrases --> RecordingInterface: Show phrases & recording UI

        state RecordingInterface {
            [*] --> ShowInitialRows: Display first 10 phrases
            ShowInitialRows --> RecordAudio: User can record audio
            RecordAudio --> AddMoreRows: Optional - add 10 more rows
            AddMoreRows --> RecordAudio
        }
    }

    RecordingInterface --> DatasetCreation: User finishes recording

    state "Dataset Creation Options" as DatasetCreation {
        [*] --> FromRecordings: Create from recorded audio
        [*] --> FromUploads: Upload existing files

        FromRecordings --> ProcessRecordings: Save WAV files + transcripts
        FromUploads --> ProcessUploads: Process uploaded files + transcripts

        ProcessRecordings --> CreateJSONL: Generate JSONL dataset
        ProcessUploads --> CreateJSONL

        CreateJSONL --> DatasetReady: Dataset saved locally
    }

    DatasetCreation --> TrainingConfiguration: Dataset ready

    state "Training Setup" as TrainingConfiguration {
        [*] --> BasicSettings: Model, LoRA/full, batch size
        [*] --> AdvancedSettings: Learning rate, epochs, LoRA params

        BasicSettings --> ConfigureDeployment: Repo name, push options
        AdvancedSettings --> ConfigureDeployment

        ConfigureDeployment --> StartTraining: All settings configured
    }

    TrainingConfiguration --> TrainingProcess: Start training

    state "Training Process" as TrainingProcess {
        [*] --> InitializeTrackio: Setup experiment tracking
        InitializeTrackio --> RunTrainingScript: Execute train.py or train_lora.py
        RunTrainingScript --> StreamLogs: Show real-time training logs
        StreamLogs --> MonitorProgress: Track metrics & checkpoints

        MonitorProgress --> TrainingComplete: Training finished
        MonitorProgress --> HandleErrors: Training failed
        HandleErrors --> RetryOrExit: User can retry or exit
    }

    TrainingProcess --> PostTraining: Training complete

    state "Post-Training Actions" as PostTraining {
        [*] --> PushToHub: Push model to HF Hub
        [*] --> GenerateModelCard: Create model card
        [*] --> DeployDemoSpace: Deploy interactive demo

        PushToHub --> ModelPublished: Model available on HF Hub
        GenerateModelCard --> ModelDocumented: Model card created
        DeployDemoSpace --> DemoReady: Demo space deployed
    }

    PostTraining --> [*]: Process complete

    %% Alternative paths
    DatasetCreation --> PushDatasetOnly: Skip training, push dataset only
    PushDatasetOnly --> DatasetPublished: Dataset on HF Hub

    %% Error handling
    TrainingProcess --> ErrorRecovery: Handle training errors
    ErrorRecovery --> RetryTraining: Retry with different settings
    RetryTraining --> TrainingConfiguration

    %% Styling and notes
    note right of LanguageSelection : User selects language for<br/>authentic phrases from<br/>NVIDIA Granary dataset
    note right of RecordingInterface : Users record themselves<br/>reading displayed phrases
    note right of DatasetCreation : JSONL format: {"audio_path": "...", "text": "..."}
    note right of TrainingConfiguration : Configure LoRA parameters,<br/>learning rate, epochs, etc.
    note right of TrainingProcess : Real-time log streaming<br/>with Trackio integration
    note right of PostTraining : Automated deployment<br/>pipeline
```

## Interface Workflow Overview

This diagram illustrates the complete user journey through the Voxtral ASR Fine-tuning interface. The workflow is designed to be intuitive and guide users through each step of the fine-tuning process.

### Key Workflow Stages

#### 1. Language & Dataset Setup
- **Language Selection**: Users choose from 25+ European languages supported by NVIDIA Granary
- **Phrase Loading**: System loads authentic, high-quality phrases in the selected language
- **Recording Interface**: Dynamic interface showing phrases with audio recording components
- **Progressive Disclosure**: Users can add more rows as needed (up to 100 recordings)

#### 2. Dataset Creation
- **From Recordings**: Process microphone recordings into WAV files and JSONL dataset
- **From Uploads**: Handle existing WAV/FLAC files with manual transcripts
- **JSONL Format**: Standard format with `audio_path` and `text` fields
- **Local Storage**: Datasets stored in `datasets/voxtral_user/` directory

#### 3. Training Configuration
- **Basic Settings**: Model selection, LoRA vs full fine-tuning, batch size
- **Advanced Settings**: Learning rate, epochs, gradient accumulation
- **LoRA Parameters**: r, alpha, dropout, audio tower freezing options
- **Repository Setup**: Model naming and Hugging Face Hub integration

#### 4. Training Process
- **Trackio Integration**: Automatic experiment tracking setup
- **Script Execution**: Calls appropriate training script (`train.py` or `train_lora.py`)
- **Log Streaming**: Real-time display of training progress and metrics
- **Error Handling**: Graceful handling of training failures with retry options

#### 5. Post-Training Actions
- **Model Publishing**: Automatic push to Hugging Face Hub
- **Model Card Generation**: Automated creation using `generate_model_card.py`
- **Demo Deployment**: One-click deployment of interactive demo spaces

### Alternative Paths

#### Dataset-Only Workflow
- Users can create and publish datasets without training models
- Useful for dataset curation and sharing

#### Error Recovery
- Training failures trigger error recovery flows
- Users can retry with modified parameters
- Comprehensive error logging and debugging information

### Technical Integration Points

#### External Services
- **NVIDIA Granary**: Source of high-quality multilingual ASR data
- **Hugging Face Hub**: Model and dataset storage and sharing
- **Trackio Spaces**: Experiment tracking and visualization

#### Script Integration
- **interface.py**: Main Gradio application orchestrating the workflow
- **train.py/train_lora.py**: Core training scripts with Trackio integration
- **push_to_huggingface.py**: Model/dataset publishing
- **deploy_demo_space.py**: Automated demo deployment
- **generate_model_card.py**: Model documentation generation

### User Experience Features

#### Progressive Interface Reveal
- Interface components are revealed as users progress through workflow
- Reduces cognitive load and guides users step-by-step

#### Real-time Feedback
- Live log streaming during training
- Progress indicators and status updates
- Immediate feedback on dataset creation and validation

#### Flexible Input Methods
- Support for both live recording and file uploads
- Multiple language options for diverse user needs
- Scalable recording interface (10-100 samples)

See also:
- [Architecture Overview](architecture.md)
- [Training Pipeline](training-pipeline.md)
- [Data Flow](data-flow.md)