VoxFactory / docs /interface-workflow.md
Joseph Pollack
adds docs
a3a3978 unverified

A newer version of the Gradio SDK is available: 5.45.0

Upgrade

Interface Workflow

stateDiagram-v2
    [*] --> LanguageSelection: User opens interface

    state "Language & Dataset Setup" as LangSetup {
        [*] --> LanguageSelection
        LanguageSelection --> LoadPhrases: Select language
        LoadPhrases --> DisplayPhrases: Load from NVIDIA Granary
        DisplayPhrases --> RecordingInterface: Show phrases & recording UI

        state RecordingInterface {
            [*] --> ShowInitialRows: Display first 10 phrases
            ShowInitialRows --> RecordAudio: User can record audio
            RecordAudio --> AddMoreRows: Optional - add 10 more rows
            AddMoreRows --> RecordAudio
        }
    }

    RecordingInterface --> DatasetCreation: User finishes recording

    state "Dataset Creation Options" as DatasetCreation {
        [*] --> FromRecordings: Create from recorded audio
        [*] --> FromUploads: Upload existing files

        FromRecordings --> ProcessRecordings: Save WAV files + transcripts
        FromUploads --> ProcessUploads: Process uploaded files + transcripts

        ProcessRecordings --> CreateJSONL: Generate JSONL dataset
        ProcessUploads --> CreateJSONL

        CreateJSONL --> DatasetReady: Dataset saved locally
    }

    DatasetCreation --> TrainingConfiguration: Dataset ready

    state "Training Setup" as TrainingConfiguration {
        [*] --> BasicSettings: Model, LoRA/full, batch size
        [*] --> AdvancedSettings: Learning rate, epochs, LoRA params

        BasicSettings --> ConfigureDeployment: Repo name, push options
        AdvancedSettings --> ConfigureDeployment

        ConfigureDeployment --> StartTraining: All settings configured
    }

    TrainingConfiguration --> TrainingProcess: Start training

    state "Training Process" as TrainingProcess {
        [*] --> InitializeTrackio: Setup experiment tracking
        InitializeTrackio --> RunTrainingScript: Execute train.py or train_lora.py
        RunTrainingScript --> StreamLogs: Show real-time training logs
        StreamLogs --> MonitorProgress: Track metrics & checkpoints

        MonitorProgress --> TrainingComplete: Training finished
        MonitorProgress --> HandleErrors: Training failed
        HandleErrors --> RetryOrExit: User can retry or exit
    }

    TrainingProcess --> PostTraining: Training complete

    state "Post-Training Actions" as PostTraining {
        [*] --> PushToHub: Push model to HF Hub
        [*] --> GenerateModelCard: Create model card
        [*] --> DeployDemoSpace: Deploy interactive demo

        PushToHub --> ModelPublished: Model available on HF Hub
        GenerateModelCard --> ModelDocumented: Model card created
        DeployDemoSpace --> DemoReady: Demo space deployed
    }

    PostTraining --> [*]: Process complete

    %% Alternative paths
    DatasetCreation --> PushDatasetOnly: Skip training, push dataset only
    PushDatasetOnly --> DatasetPublished: Dataset on HF Hub

    %% Error handling
    TrainingProcess --> ErrorRecovery: Handle training errors
    ErrorRecovery --> RetryTraining: Retry with different settings
    RetryTraining --> TrainingConfiguration

    %% Styling and notes
    note right of LanguageSelection : User selects language for<br/>authentic phrases from<br/>NVIDIA Granary dataset
    note right of RecordingInterface : Users record themselves<br/>reading displayed phrases
    note right of DatasetCreation : JSONL format: {"audio_path": "...", "text": "..."}
    note right of TrainingConfiguration : Configure LoRA parameters,<br/>learning rate, epochs, etc.
    note right of TrainingProcess : Real-time log streaming<br/>with Trackio integration
    note right of PostTraining : Automated deployment<br/>pipeline

Interface Workflow Overview

This diagram illustrates the complete user journey through the Voxtral ASR Fine-tuning interface. The workflow is designed to be intuitive and guide users through each step of the fine-tuning process.

Key Workflow Stages

1. Language & Dataset Setup

  • Language Selection: Users choose from 25+ European languages supported by NVIDIA Granary
  • Phrase Loading: System loads authentic, high-quality phrases in the selected language
  • Recording Interface: Dynamic interface showing phrases with audio recording components
  • Progressive Disclosure: Users can add more rows as needed (up to 100 recordings)

2. Dataset Creation

  • From Recordings: Process microphone recordings into WAV files and JSONL dataset
  • From Uploads: Handle existing WAV/FLAC files with manual transcripts
  • JSONL Format: Standard format with audio_path and text fields
  • Local Storage: Datasets stored in datasets/voxtral_user/ directory

3. Training Configuration

  • Basic Settings: Model selection, LoRA vs full fine-tuning, batch size
  • Advanced Settings: Learning rate, epochs, gradient accumulation
  • LoRA Parameters: r, alpha, dropout, audio tower freezing options
  • Repository Setup: Model naming and Hugging Face Hub integration

4. Training Process

  • Trackio Integration: Automatic experiment tracking setup
  • Script Execution: Calls appropriate training script (train.py or train_lora.py)
  • Log Streaming: Real-time display of training progress and metrics
  • Error Handling: Graceful handling of training failures with retry options

5. Post-Training Actions

  • Model Publishing: Automatic push to Hugging Face Hub
  • Model Card Generation: Automated creation using generate_model_card.py
  • Demo Deployment: One-click deployment of interactive demo spaces

Alternative Paths

Dataset-Only Workflow

  • Users can create and publish datasets without training models
  • Useful for dataset curation and sharing

Error Recovery

  • Training failures trigger error recovery flows
  • Users can retry with modified parameters
  • Comprehensive error logging and debugging information

Technical Integration Points

External Services

  • NVIDIA Granary: Source of high-quality multilingual ASR data
  • Hugging Face Hub: Model and dataset storage and sharing
  • Trackio Spaces: Experiment tracking and visualization

Script Integration

  • interface.py: Main Gradio application orchestrating the workflow
  • train.py/train_lora.py: Core training scripts with Trackio integration
  • push_to_huggingface.py: Model/dataset publishing
  • deploy_demo_space.py: Automated demo deployment
  • generate_model_card.py: Model documentation generation

User Experience Features

Progressive Interface Reveal

  • Interface components are revealed as users progress through workflow
  • Reduces cognitive load and guides users step-by-step

Real-time Feedback

  • Live log streaming during training
  • Progress indicators and status updates
  • Immediate feedback on dataset creation and validation

Flexible Input Methods

  • Support for both live recording and file uploads
  • Multiple language options for diverse user needs
  • Scalable recording interface (10-100 samples)

See also: