Spaces:
Running
Running
# Voxtral ASR Fine-tuning Documentation | |
```mermaid | |
graph TD | |
%% Main Entry Point | |
START([π― Voxtral ASR Fine-tuning App]) --> OVERVIEW{Choose Documentation} | |
%% Documentation Categories | |
OVERVIEW --> ARCH[ποΈ Architecture Overview] | |
OVERVIEW --> WORKFLOW[π Interface Workflow] | |
OVERVIEW --> TRAINING[π Training Pipeline] | |
OVERVIEW --> DEPLOYMENT[π Deployment Pipeline] | |
OVERVIEW --> DATAFLOW[π Data Flow] | |
%% Architecture Section | |
ARCH --> ARCH_DIAG[High-level Architecture<br/>System Components & Layers] | |
ARCH --> ARCH_LINK[π View Details β](architecture.md) | |
%% Interface Section | |
WORKFLOW --> WORKFLOW_DIAG[User Journey<br/>Recording β Training β Demo] | |
WORKFLOW --> WORKFLOW_LINK[π View Details β](interface-workflow.md) | |
%% Training Section | |
TRAINING --> TRAINING_DIAG[Training Scripts<br/>Data β Model β Results] | |
TRAINING --> TRAINING_LINK[π View Details β](training-pipeline.md) | |
%% Deployment Section | |
DEPLOYMENT --> DEPLOYMENT_DIAG[Publishing & Demo<br/>Model β Hub β Space] | |
DEPLOYMENT --> DEPLOYMENT_LINK[π View Details β](deployment-pipeline.md) | |
%% Data Flow Section | |
DATAFLOW --> DATAFLOW_DIAG[Complete Data Journey<br/>Input β Processing β Output] | |
DATAFLOW --> DATAFLOW_LINK[π View Details β](data-flow.md) | |
%% Key Components Highlight | |
subgraph "ποΈ Core Components" | |
INTERFACE[interface.py<br/>Gradio Web UI] | |
TRAIN_SCRIPTS[scripts/train*.py<br/>Training Scripts] | |
DEPLOY_SCRIPT[scripts/deploy_demo_space.py<br/>Demo Deployment] | |
PUSH_SCRIPT[scripts/push_to_huggingface.py<br/>Model Publishing] | |
end | |
%% Data Flow Highlight | |
subgraph "π Key Data Formats" | |
JSONL[JSONL Dataset<br/>{"audio_path": "...", "text": "..."}] | |
HFDATA[HF Hub Models<br/>username/model-name] | |
SPACES[HF Spaces<br/>Interactive Demos] | |
end | |
%% Connect components to their respective docs | |
INTERFACE --> WORKFLOW | |
TRAIN_SCRIPTS --> TRAINING | |
DEPLOY_SCRIPT --> DEPLOYMENT | |
PUSH_SCRIPT --> DEPLOYMENT | |
JSONL --> DATAFLOW | |
HFDATA --> DEPLOYMENT | |
SPACES --> DEPLOYMENT | |
%% Styling | |
classDef entry fill:#e3f2fd,stroke:#1976d2,stroke-width:3px | |
classDef category fill:#fff3e0,stroke:#f57c00,stroke-width:2px | |
classDef diagram fill:#e8f5e8,stroke:#388e3c,stroke-width:2px | |
classDef link fill:#fce4ec,stroke:#c2185b,stroke-width:2px | |
classDef component fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px | |
classDef data fill:#e1f5fe,stroke:#0277bd,stroke-width:2px | |
class START entry | |
class OVERVIEW,ARCH,WORKFLOW,TRAINING,DEPLOYMENT,DATAFLOW category | |
class ARCH_DIAG,WORKFLOW_DIAG,TRAINING_DIAG,DEPLOYMENT_DIAG,DATAFLOW_DIAG diagram | |
class ARCH_LINK,WORKFLOW_LINK,TRAINING_LINK,DEPLOYMENT_LINK,DATAFLOW_LINK link | |
class INTERFACE,TRAIN_SCRIPTS,DEPLOY_SCRIPT,PUSH_SCRIPT component | |
class JSONL,HFDATA,SPACES data | |
``` | |
## Voxtral ASR Fine-tuning Application | |
This documentation provides comprehensive diagrams and explanations of the Voxtral ASR Fine-tuning application architecture and workflows. | |
### π― What is Voxtral ASR Fine-tuning? | |
Voxtral is a powerful Automatic Speech Recognition (ASR) model that can be fine-tuned for specific tasks and languages. This application provides: | |
- **ποΈ Easy Data Collection**: Record audio or upload files with transcripts | |
- **π One-Click Training**: Fine-tune Voxtral with LoRA or full parameter updates | |
- **π Instant Deployment**: Deploy interactive demos to Hugging Face Spaces | |
- **π Experiment Tracking**: Monitor training progress with Trackio integration | |
### π Documentation Overview | |
#### ποΈ [Architecture Overview](architecture.md) | |
High-level view of system components and their relationships: | |
- **User Interface Layer**: Gradio web interface | |
- **Data Processing Layer**: Audio processing and dataset creation | |
- **Training Layer**: Full and LoRA fine-tuning scripts | |
- **Model Management Layer**: HF Hub integration and model cards | |
- **Deployment Layer**: Demo space deployment | |
#### π [Interface Workflow](interface-workflow.md) | |
Complete user journey through the application: | |
- **Language Selection**: Choose from 25+ languages via NVIDIA Granary | |
- **Data Collection**: Record audio or upload existing files | |
- **Dataset Creation**: Process audio + transcripts into JSONL format | |
- **Training Configuration**: Set hyperparameters and options | |
- **Live Training**: Real-time progress monitoring | |
- **Auto Deployment**: One-click model publishing and demo creation | |
#### π [Training Pipeline](training-pipeline.md) | |
Detailed training process and script interactions: | |
- **Data Sources**: JSONL datasets, HF Hub datasets, NVIDIA Granary | |
- **Data Processing**: Audio resampling, text tokenization, data collation | |
- **Training Scripts**: `train.py` (full) vs `train_lora.py` (parameter-efficient) | |
- **Infrastructure**: Trackio logging, Hugging Face Trainer, device management | |
- **Model Outputs**: Trained models, training logs, checkpoints | |
#### π [Deployment Pipeline](deployment-pipeline.md) | |
Model publishing and demo deployment process: | |
- **Model Publishing**: Push to Hugging Face Hub with metadata | |
- **Model Card Generation**: Automated documentation creation | |
- **Demo Space Deployment**: Create interactive demos on HF Spaces | |
- **Configuration Management**: Environment variables and secrets | |
- **Live Demo Features**: Real-time ASR inference interface | |
#### π [Data Flow](data-flow.md) | |
Complete data journey through the system: | |
- **Input Sources**: Microphone recordings, file uploads, external datasets | |
- **Processing Pipeline**: Audio resampling, text cleaning, JSONL conversion | |
- **Training Flow**: Dataset loading, batching, model training | |
- **Output Pipeline**: Model files, logs, checkpoints, published assets | |
- **External Integration**: HF Hub, NVIDIA Granary, Trackio Spaces | |
### π οΈ Core Components | |
| Component | Purpose | Key Features | | |
|-----------|---------|--------------| | |
| `interface.py` | Main web application | Gradio UI, data collection, training orchestration | | |
| `scripts/train.py` | Full model fine-tuning | Complete parameter updates, maximum accuracy | | |
| `scripts/train_lora.py` | LoRA fine-tuning | Parameter-efficient, faster training, lower memory | | |
| `scripts/deploy_demo_space.py` | Demo deployment | Automated HF Spaces creation and configuration | | |
| `scripts/push_to_huggingface.py` | Model publishing | HF Hub integration, model card generation | | |
| `scripts/generate_model_card.py` | Documentation | Automated model card creation from templates | | |
### π Key Data Formats | |
#### JSONL Dataset Format | |
```json | |
{"audio_path": "path/to/audio.wav", "text": "transcription text"} | |
``` | |
#### Training Configuration | |
```json | |
{ | |
"model_checkpoint": "mistralai/Voxtral-Mini-3B-2507", | |
"batch_size": 2, | |
"learning_rate": 5e-5, | |
"epochs": 3, | |
"lora_r": 8, | |
"lora_alpha": 32 | |
} | |
``` | |
#### Model Repository Structure | |
``` | |
username/model-name/ | |
βββ model.safetensors | |
βββ config.json | |
βββ tokenizer.json | |
βββ README.md (model card) | |
βββ training_results/ | |
``` | |
### π Quick Start | |
1. **Set Environment Variables**: | |
```bash | |
export HF_TOKEN=your_huggingface_token | |
export HF_USERNAME=your_username | |
``` | |
2. **Launch Interface**: | |
```bash | |
python interface.py | |
``` | |
3. **Follow the Workflow**: | |
- Select language β Record/upload data β Configure training β Start training | |
- Monitor progress β View results β Deploy demo | |
### π Prerequisites | |
- **Hardware**: NVIDIA GPU recommended for training | |
- **Software**: Python 3.8+, CUDA-compatible GPU drivers | |
- **Tokens**: Hugging Face token for model access and publishing | |
- **Storage**: Sufficient disk space for models and datasets | |
### π§ Configuration Options | |
#### Training Modes | |
- **LoRA Fine-tuning**: Efficient, fast, lower memory usage | |
- **Full Fine-tuning**: Maximum accuracy, higher memory requirements | |
#### Data Sources | |
- **User Recordings**: Live microphone input | |
- **File Uploads**: Existing WAV/FLAC files | |
- **NVIDIA Granary**: High-quality multilingual datasets | |
- **HF Hub Datasets**: Community-contributed datasets | |
#### Deployment Options | |
- **HF Hub Publishing**: Share models publicly | |
- **Demo Spaces**: Interactive web demos | |
- **Model Cards**: Automated documentation | |
### π Performance & Metrics | |
#### Training Metrics | |
- **Loss Curves**: Training and validation loss | |
- **Perplexity**: Model confidence measure | |
- **Word Error Rate**: ASR accuracy (if available) | |
- **Training Time**: Time to convergence | |
#### Resource Usage | |
- **GPU Memory**: Peak memory usage during training | |
- **Training Time**: Hours/days depending on dataset size | |
- **Model Size**: Disk space requirements | |
### π€ Contributing | |
The documentation is organized as interlinked Markdown files with Mermaid diagrams. Each diagram focuses on a specific aspect: | |
- **architecture.md**: System overview and component relationships | |
- **interface-workflow.md**: User experience and interaction flow | |
- **training-pipeline.md**: Technical training process details | |
- **deployment-pipeline.md**: Publishing and deployment mechanics | |
- **data-flow.md**: Data movement and transformation | |
### π Additional Resources | |
- **Hugging Face Spaces**: [Live Demo](https://huggingface.co/spaces) | |
- **Voxtral Models**: [Model Hub](https://huggingface.co/mistralai) | |
- **NVIDIA Granary**: [Dataset Documentation](https://huggingface.co/nvidia/Granary) | |
- **Trackio**: [Experiment Tracking](https://trackio.space) | |
--- | |
*This documentation was automatically generated to explain the Voxtral ASR Fine-tuning application architecture and workflows.* | |