Spaces:
Running
Running
A newer version of the Gradio SDK is available:
5.45.0
Interface Workflow
stateDiagram-v2
[*] --> LanguageSelection: User opens interface
state "Language & Dataset Setup" as LangSetup {
[*] --> LanguageSelection
LanguageSelection --> LoadPhrases: Select language
LoadPhrases --> DisplayPhrases: Load from NVIDIA Granary
DisplayPhrases --> RecordingInterface: Show phrases & recording UI
state RecordingInterface {
[*] --> ShowInitialRows: Display first 10 phrases
ShowInitialRows --> RecordAudio: User can record audio
RecordAudio --> AddMoreRows: Optional - add 10 more rows
AddMoreRows --> RecordAudio
}
}
RecordingInterface --> DatasetCreation: User finishes recording
state "Dataset Creation Options" as DatasetCreation {
[*] --> FromRecordings: Create from recorded audio
[*] --> FromUploads: Upload existing files
FromRecordings --> ProcessRecordings: Save WAV files + transcripts
FromUploads --> ProcessUploads: Process uploaded files + transcripts
ProcessRecordings --> CreateJSONL: Generate JSONL dataset
ProcessUploads --> CreateJSONL
CreateJSONL --> DatasetReady: Dataset saved locally
}
DatasetCreation --> TrainingConfiguration: Dataset ready
state "Training Setup" as TrainingConfiguration {
[*] --> BasicSettings: Model, LoRA/full, batch size
[*] --> AdvancedSettings: Learning rate, epochs, LoRA params
BasicSettings --> ConfigureDeployment: Repo name, push options
AdvancedSettings --> ConfigureDeployment
ConfigureDeployment --> StartTraining: All settings configured
}
TrainingConfiguration --> TrainingProcess: Start training
state "Training Process" as TrainingProcess {
[*] --> InitializeTrackio: Setup experiment tracking
InitializeTrackio --> RunTrainingScript: Execute train.py or train_lora.py
RunTrainingScript --> StreamLogs: Show real-time training logs
StreamLogs --> MonitorProgress: Track metrics & checkpoints
MonitorProgress --> TrainingComplete: Training finished
MonitorProgress --> HandleErrors: Training failed
HandleErrors --> RetryOrExit: User can retry or exit
}
TrainingProcess --> PostTraining: Training complete
state "Post-Training Actions" as PostTraining {
[*] --> PushToHub: Push model to HF Hub
[*] --> GenerateModelCard: Create model card
[*] --> DeployDemoSpace: Deploy interactive demo
PushToHub --> ModelPublished: Model available on HF Hub
GenerateModelCard --> ModelDocumented: Model card created
DeployDemoSpace --> DemoReady: Demo space deployed
}
PostTraining --> [*]: Process complete
%% Alternative paths
DatasetCreation --> PushDatasetOnly: Skip training, push dataset only
PushDatasetOnly --> DatasetPublished: Dataset on HF Hub
%% Error handling
TrainingProcess --> ErrorRecovery: Handle training errors
ErrorRecovery --> RetryTraining: Retry with different settings
RetryTraining --> TrainingConfiguration
%% Styling and notes
note right of LanguageSelection : User selects language for<br/>authentic phrases from<br/>NVIDIA Granary dataset
note right of RecordingInterface : Users record themselves<br/>reading displayed phrases
note right of DatasetCreation : JSONL format: {"audio_path": "...", "text": "..."}
note right of TrainingConfiguration : Configure LoRA parameters,<br/>learning rate, epochs, etc.
note right of TrainingProcess : Real-time log streaming<br/>with Trackio integration
note right of PostTraining : Automated deployment<br/>pipeline
Interface Workflow Overview
This diagram illustrates the complete user journey through the Voxtral ASR Fine-tuning interface. The workflow is designed to be intuitive and guide users through each step of the fine-tuning process.
Key Workflow Stages
1. Language & Dataset Setup
- Language Selection: Users choose from 25+ European languages supported by NVIDIA Granary
- Phrase Loading: System loads authentic, high-quality phrases in the selected language
- Recording Interface: Dynamic interface showing phrases with audio recording components
- Progressive Disclosure: Users can add more rows as needed (up to 100 recordings)
2. Dataset Creation
- From Recordings: Process microphone recordings into WAV files and JSONL dataset
- From Uploads: Handle existing WAV/FLAC files with manual transcripts
- JSONL Format: Standard format with
audio_path
andtext
fields - Local Storage: Datasets stored in
datasets/voxtral_user/
directory
3. Training Configuration
- Basic Settings: Model selection, LoRA vs full fine-tuning, batch size
- Advanced Settings: Learning rate, epochs, gradient accumulation
- LoRA Parameters: r, alpha, dropout, audio tower freezing options
- Repository Setup: Model naming and Hugging Face Hub integration
4. Training Process
- Trackio Integration: Automatic experiment tracking setup
- Script Execution: Calls appropriate training script (
train.py
ortrain_lora.py
) - Log Streaming: Real-time display of training progress and metrics
- Error Handling: Graceful handling of training failures with retry options
5. Post-Training Actions
- Model Publishing: Automatic push to Hugging Face Hub
- Model Card Generation: Automated creation using
generate_model_card.py
- Demo Deployment: One-click deployment of interactive demo spaces
Alternative Paths
Dataset-Only Workflow
- Users can create and publish datasets without training models
- Useful for dataset curation and sharing
Error Recovery
- Training failures trigger error recovery flows
- Users can retry with modified parameters
- Comprehensive error logging and debugging information
Technical Integration Points
External Services
- NVIDIA Granary: Source of high-quality multilingual ASR data
- Hugging Face Hub: Model and dataset storage and sharing
- Trackio Spaces: Experiment tracking and visualization
Script Integration
- interface.py: Main Gradio application orchestrating the workflow
- train.py/train_lora.py: Core training scripts with Trackio integration
- push_to_huggingface.py: Model/dataset publishing
- deploy_demo_space.py: Automated demo deployment
- generate_model_card.py: Model documentation generation
User Experience Features
Progressive Interface Reveal
- Interface components are revealed as users progress through workflow
- Reduces cognitive load and guides users step-by-step
Real-time Feedback
- Live log streaming during training
- Progress indicators and status updates
- Immediate feedback on dataset creation and validation
Flexible Input Methods
- Support for both live recording and file uploads
- Multiple language options for diverse user needs
- Scalable recording interface (10-100 samples)
See also: