VoxFactory / docs /README.md
Joseph Pollack
adds docs
a3a3978 unverified

A newer version of the Gradio SDK is available: 5.45.0

Upgrade

Voxtral ASR Fine-tuning Documentation

graph TD
    %% Main Entry Point
    START([🎯 Voxtral ASR Fine-tuning App]) --> OVERVIEW{Choose Documentation}

    %% Documentation Categories
    OVERVIEW --> ARCH[πŸ—οΈ Architecture Overview]
    OVERVIEW --> WORKFLOW[πŸ”„ Interface Workflow]
    OVERVIEW --> TRAINING[πŸš€ Training Pipeline]
    OVERVIEW --> DEPLOYMENT[🌐 Deployment Pipeline]
    OVERVIEW --> DATAFLOW[πŸ“Š Data Flow]

    %% Architecture Section
    ARCH --> ARCH_DIAG[High-level Architecture<br/>System Components & Layers]
    ARCH --> ARCH_LINK[πŸ“„ View Details β†’](architecture.md)

    %% Interface Section
    WORKFLOW --> WORKFLOW_DIAG[User Journey<br/>Recording β†’ Training β†’ Demo]
    WORKFLOW --> WORKFLOW_LINK[πŸ“„ View Details β†’](interface-workflow.md)

    %% Training Section
    TRAINING --> TRAINING_DIAG[Training Scripts<br/>Data β†’ Model β†’ Results]
    TRAINING --> TRAINING_LINK[πŸ“„ View Details β†’](training-pipeline.md)

    %% Deployment Section
    DEPLOYMENT --> DEPLOYMENT_DIAG[Publishing & Demo<br/>Model β†’ Hub β†’ Space]
    DEPLOYMENT --> DEPLOYMENT_LINK[πŸ“„ View Details β†’](deployment-pipeline.md)

    %% Data Flow Section
    DATAFLOW --> DATAFLOW_DIAG[Complete Data Journey<br/>Input β†’ Processing β†’ Output]
    DATAFLOW --> DATAFLOW_LINK[πŸ“„ View Details β†’](data-flow.md)

    %% Key Components Highlight
    subgraph "πŸŽ›οΈ Core Components"
        INTERFACE[interface.py<br/>Gradio Web UI]
        TRAIN_SCRIPTS[scripts/train*.py<br/>Training Scripts]
        DEPLOY_SCRIPT[scripts/deploy_demo_space.py<br/>Demo Deployment]
        PUSH_SCRIPT[scripts/push_to_huggingface.py<br/>Model Publishing]
    end

    %% Data Flow Highlight
    subgraph "πŸ“ Key Data Formats"
        JSONL[JSONL Dataset<br/>{"audio_path": "...", "text": "..."}]
        HFDATA[HF Hub Models<br/>username/model-name]
        SPACES[HF Spaces<br/>Interactive Demos]
    end

    %% Connect components to their respective docs
    INTERFACE --> WORKFLOW
    TRAIN_SCRIPTS --> TRAINING
    DEPLOY_SCRIPT --> DEPLOYMENT
    PUSH_SCRIPT --> DEPLOYMENT

    JSONL --> DATAFLOW
    HFDATA --> DEPLOYMENT
    SPACES --> DEPLOYMENT

    %% Styling
    classDef entry fill:#e3f2fd,stroke:#1976d2,stroke-width:3px
    classDef category fill:#fff3e0,stroke:#f57c00,stroke-width:2px
    classDef diagram fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
    classDef link fill:#fce4ec,stroke:#c2185b,stroke-width:2px
    classDef component fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
    classDef data fill:#e1f5fe,stroke:#0277bd,stroke-width:2px

    class START entry
    class OVERVIEW,ARCH,WORKFLOW,TRAINING,DEPLOYMENT,DATAFLOW category
    class ARCH_DIAG,WORKFLOW_DIAG,TRAINING_DIAG,DEPLOYMENT_DIAG,DATAFLOW_DIAG diagram
    class ARCH_LINK,WORKFLOW_LINK,TRAINING_LINK,DEPLOYMENT_LINK,DATAFLOW_LINK link
    class INTERFACE,TRAIN_SCRIPTS,DEPLOY_SCRIPT,PUSH_SCRIPT component
    class JSONL,HFDATA,SPACES data

Voxtral ASR Fine-tuning Application

This documentation provides comprehensive diagrams and explanations of the Voxtral ASR Fine-tuning application architecture and workflows.

🎯 What is Voxtral ASR Fine-tuning?

Voxtral is a powerful Automatic Speech Recognition (ASR) model that can be fine-tuned for specific tasks and languages. This application provides:

  • πŸŽ™οΈ Easy Data Collection: Record audio or upload files with transcripts
  • πŸš€ One-Click Training: Fine-tune Voxtral with LoRA or full parameter updates
  • 🌐 Instant Deployment: Deploy interactive demos to Hugging Face Spaces
  • πŸ“Š Experiment Tracking: Monitor training progress with Trackio integration

πŸ“š Documentation Overview

πŸ—οΈ Architecture Overview

High-level view of system components and their relationships:

  • User Interface Layer: Gradio web interface
  • Data Processing Layer: Audio processing and dataset creation
  • Training Layer: Full and LoRA fine-tuning scripts
  • Model Management Layer: HF Hub integration and model cards
  • Deployment Layer: Demo space deployment

πŸ”„ Interface Workflow

Complete user journey through the application:

  • Language Selection: Choose from 25+ languages via NVIDIA Granary
  • Data Collection: Record audio or upload existing files
  • Dataset Creation: Process audio + transcripts into JSONL format
  • Training Configuration: Set hyperparameters and options
  • Live Training: Real-time progress monitoring
  • Auto Deployment: One-click model publishing and demo creation

πŸš€ Training Pipeline

Detailed training process and script interactions:

  • Data Sources: JSONL datasets, HF Hub datasets, NVIDIA Granary
  • Data Processing: Audio resampling, text tokenization, data collation
  • Training Scripts: train.py (full) vs train_lora.py (parameter-efficient)
  • Infrastructure: Trackio logging, Hugging Face Trainer, device management
  • Model Outputs: Trained models, training logs, checkpoints

🌐 Deployment Pipeline

Model publishing and demo deployment process:

  • Model Publishing: Push to Hugging Face Hub with metadata
  • Model Card Generation: Automated documentation creation
  • Demo Space Deployment: Create interactive demos on HF Spaces
  • Configuration Management: Environment variables and secrets
  • Live Demo Features: Real-time ASR inference interface

πŸ“Š Data Flow

Complete data journey through the system:

  • Input Sources: Microphone recordings, file uploads, external datasets
  • Processing Pipeline: Audio resampling, text cleaning, JSONL conversion
  • Training Flow: Dataset loading, batching, model training
  • Output Pipeline: Model files, logs, checkpoints, published assets
  • External Integration: HF Hub, NVIDIA Granary, Trackio Spaces

πŸ› οΈ Core Components

Component Purpose Key Features
interface.py Main web application Gradio UI, data collection, training orchestration
scripts/train.py Full model fine-tuning Complete parameter updates, maximum accuracy
scripts/train_lora.py LoRA fine-tuning Parameter-efficient, faster training, lower memory
scripts/deploy_demo_space.py Demo deployment Automated HF Spaces creation and configuration
scripts/push_to_huggingface.py Model publishing HF Hub integration, model card generation
scripts/generate_model_card.py Documentation Automated model card creation from templates

πŸ“ Key Data Formats

JSONL Dataset Format

{"audio_path": "path/to/audio.wav", "text": "transcription text"}

Training Configuration

{
  "model_checkpoint": "mistralai/Voxtral-Mini-3B-2507",
  "batch_size": 2,
  "learning_rate": 5e-5,
  "epochs": 3,
  "lora_r": 8,
  "lora_alpha": 32
}

Model Repository Structure

username/model-name/
β”œβ”€β”€ model.safetensors
β”œβ”€β”€ config.json
β”œβ”€β”€ tokenizer.json
β”œβ”€β”€ README.md (model card)
└── training_results/

πŸš€ Quick Start

  1. Set Environment Variables:

    export HF_TOKEN=your_huggingface_token
    export HF_USERNAME=your_username
    
  2. Launch Interface:

    python interface.py
    
  3. Follow the Workflow:

    • Select language β†’ Record/upload data β†’ Configure training β†’ Start training
    • Monitor progress β†’ View results β†’ Deploy demo

πŸ“‹ Prerequisites

  • Hardware: NVIDIA GPU recommended for training
  • Software: Python 3.8+, CUDA-compatible GPU drivers
  • Tokens: Hugging Face token for model access and publishing
  • Storage: Sufficient disk space for models and datasets

πŸ”§ Configuration Options

Training Modes

  • LoRA Fine-tuning: Efficient, fast, lower memory usage
  • Full Fine-tuning: Maximum accuracy, higher memory requirements

Data Sources

  • User Recordings: Live microphone input
  • File Uploads: Existing WAV/FLAC files
  • NVIDIA Granary: High-quality multilingual datasets
  • HF Hub Datasets: Community-contributed datasets

Deployment Options

  • HF Hub Publishing: Share models publicly
  • Demo Spaces: Interactive web demos
  • Model Cards: Automated documentation

πŸ“ˆ Performance & Metrics

Training Metrics

  • Loss Curves: Training and validation loss
  • Perplexity: Model confidence measure
  • Word Error Rate: ASR accuracy (if available)
  • Training Time: Time to convergence

Resource Usage

  • GPU Memory: Peak memory usage during training
  • Training Time: Hours/days depending on dataset size
  • Model Size: Disk space requirements

🀝 Contributing

The documentation is organized as interlinked Markdown files with Mermaid diagrams. Each diagram focuses on a specific aspect:

  • architecture.md: System overview and component relationships
  • interface-workflow.md: User experience and interaction flow
  • training-pipeline.md: Technical training process details
  • deployment-pipeline.md: Publishing and deployment mechanics
  • data-flow.md: Data movement and transformation

πŸ“„ Additional Resources


This documentation was automatically generated to explain the Voxtral ASR Fine-tuning application architecture and workflows.