# Interface Workflow ```mermaid stateDiagram-v2 [*] --> LanguageSelection: User opens interface state "Language & Dataset Setup" as LangSetup { [*] --> LanguageSelection LanguageSelection --> LoadPhrases: Select language LoadPhrases --> DisplayPhrases: Load from NVIDIA Granary DisplayPhrases --> RecordingInterface: Show phrases & recording UI state RecordingInterface { [*] --> ShowInitialRows: Display first 10 phrases ShowInitialRows --> RecordAudio: User can record audio RecordAudio --> AddMoreRows: Optional - add 10 more rows AddMoreRows --> RecordAudio } } RecordingInterface --> DatasetCreation: User finishes recording state "Dataset Creation Options" as DatasetCreation { [*] --> FromRecordings: Create from recorded audio [*] --> FromUploads: Upload existing files FromRecordings --> ProcessRecordings: Save WAV files + transcripts FromUploads --> ProcessUploads: Process uploaded files + transcripts ProcessRecordings --> CreateJSONL: Generate JSONL dataset ProcessUploads --> CreateJSONL CreateJSONL --> DatasetReady: Dataset saved locally } DatasetCreation --> TrainingConfiguration: Dataset ready state "Training Setup" as TrainingConfiguration { [*] --> BasicSettings: Model, LoRA/full, batch size [*] --> AdvancedSettings: Learning rate, epochs, LoRA params BasicSettings --> ConfigureDeployment: Repo name, push options AdvancedSettings --> ConfigureDeployment ConfigureDeployment --> StartTraining: All settings configured } TrainingConfiguration --> TrainingProcess: Start training state "Training Process" as TrainingProcess { [*] --> InitializeTrackio: Setup experiment tracking InitializeTrackio --> RunTrainingScript: Execute train.py or train_lora.py RunTrainingScript --> StreamLogs: Show real-time training logs StreamLogs --> MonitorProgress: Track metrics & checkpoints MonitorProgress --> TrainingComplete: Training finished MonitorProgress --> HandleErrors: Training failed HandleErrors --> RetryOrExit: User can retry or exit } TrainingProcess --> PostTraining: Training complete state "Post-Training Actions" as PostTraining { [*] --> PushToHub: Push model to HF Hub [*] --> GenerateModelCard: Create model card [*] --> DeployDemoSpace: Deploy interactive demo PushToHub --> ModelPublished: Model available on HF Hub GenerateModelCard --> ModelDocumented: Model card created DeployDemoSpace --> DemoReady: Demo space deployed } PostTraining --> [*]: Process complete %% Alternative paths DatasetCreation --> PushDatasetOnly: Skip training, push dataset only PushDatasetOnly --> DatasetPublished: Dataset on HF Hub %% Error handling TrainingProcess --> ErrorRecovery: Handle training errors ErrorRecovery --> RetryTraining: Retry with different settings RetryTraining --> TrainingConfiguration %% Styling and notes note right of LanguageSelection : User selects language for
authentic phrases from
NVIDIA Granary dataset note right of RecordingInterface : Users record themselves
reading displayed phrases note right of DatasetCreation : JSONL format: {"audio_path": "...", "text": "..."} note right of TrainingConfiguration : Configure LoRA parameters,
learning rate, epochs, etc. note right of TrainingProcess : Real-time log streaming
with Trackio integration note right of PostTraining : Automated deployment
pipeline ``` ## Interface Workflow Overview This diagram illustrates the complete user journey through the Voxtral ASR Fine-tuning interface. The workflow is designed to be intuitive and guide users through each step of the fine-tuning process. ### Key Workflow Stages #### 1. Language & Dataset Setup - **Language Selection**: Users choose from 25+ European languages supported by NVIDIA Granary - **Phrase Loading**: System loads authentic, high-quality phrases in the selected language - **Recording Interface**: Dynamic interface showing phrases with audio recording components - **Progressive Disclosure**: Users can add more rows as needed (up to 100 recordings) #### 2. Dataset Creation - **From Recordings**: Process microphone recordings into WAV files and JSONL dataset - **From Uploads**: Handle existing WAV/FLAC files with manual transcripts - **JSONL Format**: Standard format with `audio_path` and `text` fields - **Local Storage**: Datasets stored in `datasets/voxtral_user/` directory #### 3. Training Configuration - **Basic Settings**: Model selection, LoRA vs full fine-tuning, batch size - **Advanced Settings**: Learning rate, epochs, gradient accumulation - **LoRA Parameters**: r, alpha, dropout, audio tower freezing options - **Repository Setup**: Model naming and Hugging Face Hub integration #### 4. Training Process - **Trackio Integration**: Automatic experiment tracking setup - **Script Execution**: Calls appropriate training script (`train.py` or `train_lora.py`) - **Log Streaming**: Real-time display of training progress and metrics - **Error Handling**: Graceful handling of training failures with retry options #### 5. Post-Training Actions - **Model Publishing**: Automatic push to Hugging Face Hub - **Model Card Generation**: Automated creation using `generate_model_card.py` - **Demo Deployment**: One-click deployment of interactive demo spaces ### Alternative Paths #### Dataset-Only Workflow - Users can create and publish datasets without training models - Useful for dataset curation and sharing #### Error Recovery - Training failures trigger error recovery flows - Users can retry with modified parameters - Comprehensive error logging and debugging information ### Technical Integration Points #### External Services - **NVIDIA Granary**: Source of high-quality multilingual ASR data - **Hugging Face Hub**: Model and dataset storage and sharing - **Trackio Spaces**: Experiment tracking and visualization #### Script Integration - **interface.py**: Main Gradio application orchestrating the workflow - **train.py/train_lora.py**: Core training scripts with Trackio integration - **push_to_huggingface.py**: Model/dataset publishing - **deploy_demo_space.py**: Automated demo deployment - **generate_model_card.py**: Model documentation generation ### User Experience Features #### Progressive Interface Reveal - Interface components are revealed as users progress through workflow - Reduces cognitive load and guides users step-by-step #### Real-time Feedback - Live log streaming during training - Progress indicators and status updates - Immediate feedback on dataset creation and validation #### Flexible Input Methods - Support for both live recording and file uploads - Multiple language options for diverse user needs - Scalable recording interface (10-100 samples) See also: - [Architecture Overview](architecture.md) - [Training Pipeline](training-pipeline.md) - [Data Flow](data-flow.md)