# Voxtral ASR Fine-tuning Architecture ```mermaid graph TB %% User Interface Layer subgraph "User Interface" UI[Gradio Web Interface
interface.py] REC[Audio Recording
Microphone Input] UP[File Upload
WAV/FLAC files] end %% Data Processing Layer subgraph "Data Processing" DP[Data Processing
Audio resampling
JSONL creation] DS[Dataset Management
NVIDIA Granary
Local datasets] end %% Training Layer subgraph "Training Pipeline" TF[Full Fine-tuning
scripts/train.py] TL[LoRA Fine-tuning
scripts/train_lora.py] TI[Trackio Integration
Experiment Tracking] end %% Model Management Layer subgraph "Model Management" MM[Model Management
Hugging Face Hub
Local storage] MC[Model Card Generation
scripts/generate_model_card.py] end %% Deployment Layer subgraph "Deployment & Demo" DEP[Demo Space Deployment
scripts/deploy_demo_space.py] HF[HF Spaces
Interactive Demo] end %% External Services subgraph "External Services" HFH[Hugging Face Hub
Models & Datasets] GRAN[NVIDIA Granary
Multilingual ASR Dataset] TRACK[Trackio Spaces
Experiment Tracking] end %% Data Flow UI --> DP REC --> DP UP --> DP DP --> DS DS --> TF DS --> TL TF --> TI TL --> TI TF --> MM TL --> MM MM --> MC MM --> DEP DEP --> HF DS -.-> HFH MM -.-> HFH TI -.-> TRACK DS -.-> GRAN %% Styling classDef interface fill:#e1f5fe,stroke:#01579b,stroke-width:2px classDef processing fill:#f3e5f5,stroke:#4a148c,stroke-width:2px classDef training fill:#e8f5e8,stroke:#1b5e20,stroke-width:2px classDef management fill:#fff3e0,stroke:#e65100,stroke-width:2px classDef deployment fill:#fce4ec,stroke:#880e4f,stroke-width:2px classDef external fill:#f5f5f5,stroke:#424242,stroke-width:2px class UI,REC,UP interface class DP,DS processing class TF,TL,TI training class MM,MC management class DEP,HF deployment class HFH,GRAN,TRACK external ``` ## Architecture Overview This diagram shows the high-level architecture of the Voxtral ASR Fine-tuning application. The system is organized into several layers: ### 1. User Interface Layer - **Gradio Web Interface**: Main user-facing application built with Gradio - **Audio Recording**: Microphone input for recording speech samples - **File Upload**: Support for uploading existing WAV/FLAC audio files ### 2. Data Processing Layer - **Data Processing**: Audio resampling to 16kHz, JSONL dataset creation - **Dataset Management**: Integration with NVIDIA Granary dataset and local dataset handling ### 3. Training Layer - **Full Fine-tuning**: Complete model fine-tuning using `scripts/train.py` - **LoRA Fine-tuning**: Parameter-efficient fine-tuning using `scripts/train_lora.py` - **Trackio Integration**: Experiment tracking and logging ### 4. Model Management Layer - **Model Management**: Local storage and Hugging Face Hub integration - **Model Card Generation**: Automated model card creation ### 5. Deployment Layer - **Demo Space Deployment**: Automated deployment to Hugging Face Spaces - **Interactive Demo**: Live demo interface for testing fine-tuned models ### 6. External Services - **Hugging Face Hub**: Model and dataset storage and sharing - **NVIDIA Granary**: High-quality multilingual ASR dataset - **Trackio Spaces**: Experiment tracking and visualization ## Key Workflows 1. **Dataset Creation**: Users can record audio or upload files → processed into JSONL format 2. **Model Training**: Datasets fed into training scripts with experiment tracking 3. **Model Publishing**: Trained models pushed to HF Hub with generated model cards 4. **Demo Deployment**: Automated deployment of interactive demos to HF Spaces See also: - [Interface Workflow](interface-workflow.md) - [Training Pipeline](training-pipeline.md) - [Deployment Pipeline](deployment-pipeline.md) - [Data Flow](data-flow.md)