Spaces:
Running
Running
File size: 7,178 Bytes
a3a3978 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 |
# Interface Workflow
```mermaid
stateDiagram-v2
[*] --> LanguageSelection: User opens interface
state "Language & Dataset Setup" as LangSetup {
[*] --> LanguageSelection
LanguageSelection --> LoadPhrases: Select language
LoadPhrases --> DisplayPhrases: Load from NVIDIA Granary
DisplayPhrases --> RecordingInterface: Show phrases & recording UI
state RecordingInterface {
[*] --> ShowInitialRows: Display first 10 phrases
ShowInitialRows --> RecordAudio: User can record audio
RecordAudio --> AddMoreRows: Optional - add 10 more rows
AddMoreRows --> RecordAudio
}
}
RecordingInterface --> DatasetCreation: User finishes recording
state "Dataset Creation Options" as DatasetCreation {
[*] --> FromRecordings: Create from recorded audio
[*] --> FromUploads: Upload existing files
FromRecordings --> ProcessRecordings: Save WAV files + transcripts
FromUploads --> ProcessUploads: Process uploaded files + transcripts
ProcessRecordings --> CreateJSONL: Generate JSONL dataset
ProcessUploads --> CreateJSONL
CreateJSONL --> DatasetReady: Dataset saved locally
}
DatasetCreation --> TrainingConfiguration: Dataset ready
state "Training Setup" as TrainingConfiguration {
[*] --> BasicSettings: Model, LoRA/full, batch size
[*] --> AdvancedSettings: Learning rate, epochs, LoRA params
BasicSettings --> ConfigureDeployment: Repo name, push options
AdvancedSettings --> ConfigureDeployment
ConfigureDeployment --> StartTraining: All settings configured
}
TrainingConfiguration --> TrainingProcess: Start training
state "Training Process" as TrainingProcess {
[*] --> InitializeTrackio: Setup experiment tracking
InitializeTrackio --> RunTrainingScript: Execute train.py or train_lora.py
RunTrainingScript --> StreamLogs: Show real-time training logs
StreamLogs --> MonitorProgress: Track metrics & checkpoints
MonitorProgress --> TrainingComplete: Training finished
MonitorProgress --> HandleErrors: Training failed
HandleErrors --> RetryOrExit: User can retry or exit
}
TrainingProcess --> PostTraining: Training complete
state "Post-Training Actions" as PostTraining {
[*] --> PushToHub: Push model to HF Hub
[*] --> GenerateModelCard: Create model card
[*] --> DeployDemoSpace: Deploy interactive demo
PushToHub --> ModelPublished: Model available on HF Hub
GenerateModelCard --> ModelDocumented: Model card created
DeployDemoSpace --> DemoReady: Demo space deployed
}
PostTraining --> [*]: Process complete
%% Alternative paths
DatasetCreation --> PushDatasetOnly: Skip training, push dataset only
PushDatasetOnly --> DatasetPublished: Dataset on HF Hub
%% Error handling
TrainingProcess --> ErrorRecovery: Handle training errors
ErrorRecovery --> RetryTraining: Retry with different settings
RetryTraining --> TrainingConfiguration
%% Styling and notes
note right of LanguageSelection : User selects language for<br/>authentic phrases from<br/>NVIDIA Granary dataset
note right of RecordingInterface : Users record themselves<br/>reading displayed phrases
note right of DatasetCreation : JSONL format: {"audio_path": "...", "text": "..."}
note right of TrainingConfiguration : Configure LoRA parameters,<br/>learning rate, epochs, etc.
note right of TrainingProcess : Real-time log streaming<br/>with Trackio integration
note right of PostTraining : Automated deployment<br/>pipeline
```
## Interface Workflow Overview
This diagram illustrates the complete user journey through the Voxtral ASR Fine-tuning interface. The workflow is designed to be intuitive and guide users through each step of the fine-tuning process.
### Key Workflow Stages
#### 1. Language & Dataset Setup
- **Language Selection**: Users choose from 25+ European languages supported by NVIDIA Granary
- **Phrase Loading**: System loads authentic, high-quality phrases in the selected language
- **Recording Interface**: Dynamic interface showing phrases with audio recording components
- **Progressive Disclosure**: Users can add more rows as needed (up to 100 recordings)
#### 2. Dataset Creation
- **From Recordings**: Process microphone recordings into WAV files and JSONL dataset
- **From Uploads**: Handle existing WAV/FLAC files with manual transcripts
- **JSONL Format**: Standard format with `audio_path` and `text` fields
- **Local Storage**: Datasets stored in `datasets/voxtral_user/` directory
#### 3. Training Configuration
- **Basic Settings**: Model selection, LoRA vs full fine-tuning, batch size
- **Advanced Settings**: Learning rate, epochs, gradient accumulation
- **LoRA Parameters**: r, alpha, dropout, audio tower freezing options
- **Repository Setup**: Model naming and Hugging Face Hub integration
#### 4. Training Process
- **Trackio Integration**: Automatic experiment tracking setup
- **Script Execution**: Calls appropriate training script (`train.py` or `train_lora.py`)
- **Log Streaming**: Real-time display of training progress and metrics
- **Error Handling**: Graceful handling of training failures with retry options
#### 5. Post-Training Actions
- **Model Publishing**: Automatic push to Hugging Face Hub
- **Model Card Generation**: Automated creation using `generate_model_card.py`
- **Demo Deployment**: One-click deployment of interactive demo spaces
### Alternative Paths
#### Dataset-Only Workflow
- Users can create and publish datasets without training models
- Useful for dataset curation and sharing
#### Error Recovery
- Training failures trigger error recovery flows
- Users can retry with modified parameters
- Comprehensive error logging and debugging information
### Technical Integration Points
#### External Services
- **NVIDIA Granary**: Source of high-quality multilingual ASR data
- **Hugging Face Hub**: Model and dataset storage and sharing
- **Trackio Spaces**: Experiment tracking and visualization
#### Script Integration
- **interface.py**: Main Gradio application orchestrating the workflow
- **train.py/train_lora.py**: Core training scripts with Trackio integration
- **push_to_huggingface.py**: Model/dataset publishing
- **deploy_demo_space.py**: Automated demo deployment
- **generate_model_card.py**: Model documentation generation
### User Experience Features
#### Progressive Interface Reveal
- Interface components are revealed as users progress through workflow
- Reduces cognitive load and guides users step-by-step
#### Real-time Feedback
- Live log streaming during training
- Progress indicators and status updates
- Immediate feedback on dataset creation and validation
#### Flexible Input Methods
- Support for both live recording and file uploads
- Multiple language options for diverse user needs
- Scalable recording interface (10-100 samples)
See also:
- [Architecture Overview](architecture.md)
- [Training Pipeline](training-pipeline.md)
- [Data Flow](data-flow.md)
|