Spaces:
Running
Running
A newer version of the Gradio SDK is available:
5.45.0
Data Flow
flowchart TD
%% User Input Sources
subgraph "User Input"
MIC[π€ Microphone Recording<br/>Raw audio + timestamps]
FILE[π File Upload<br/>WAV/FLAC files]
TEXT[π Manual Transcripts<br/>Text input]
LANG[π Language Selection<br/>25+ languages]
end
%% Data Processing Pipeline
subgraph "Data Processing"
AUDIO_PROC[Audio Processing<br/>Resampling to 16kHz<br/>Format conversion]
TEXT_PROC[Text Processing<br/>Transcript validation<br/>Cleaning & formatting]
JSONL_CONV[JSONL Conversion<br/>{"audio_path": "...", "text": "..."}]
end
%% Dataset Storage
subgraph "Dataset Storage"
LOCAL_DS[Local Dataset<br/>datasets/voxtral_user/<br/>data.jsonl + wavs/]
HF_DS[HF Hub Dataset<br/>username/dataset-name<br/>Public sharing]
end
%% Training Data Flow
subgraph "Training Data Pipeline"
DS_LOADER[Dataset Loader<br/>_load_jsonl_dataset()<br/>or load_dataset()]
AUDIO_CAST[Audio Casting<br/>Audio(sampling_rate=16000)]
TRAIN_SPLIT[Train Split<br/>train_dataset]
EVAL_SPLIT[Eval Split<br/>eval_dataset]
end
%% Model Training
subgraph "Model Training"
COLLATOR[VoxtralDataCollator<br/>Audio + Text batching<br/>Prompt construction]
FORWARD[Forward Pass<br/>Audio β Features β Text]
LOSS[Loss Calculation<br/>Masked LM loss]
BACKWARD[Backward Pass<br/>Gradient computation]
OPTIMIZE[Parameter Updates<br/>LoRA or full fine-tuning]
end
%% Training Outputs
subgraph "Training Outputs"
MODEL_FILES[Model Files<br/>model.safetensors<br/>config.json<br/>tokenizer.json]
TRAINING_LOGS[Training Logs<br/>train_results.json<br/>training_config.json<br/>loss curves]
CHECKPOINTS[Checkpoints<br/>Intermediate models<br/>best model tracking]
end
%% Publishing Pipeline
subgraph "Publishing Pipeline"
HF_REPO[HF Repository<br/>username/model-name<br/>Model hosting]
MODEL_CARD[Model Card<br/>README.md<br/>Training details<br/>Usage examples]
METADATA[Training Metadata<br/>Config + results<br/>Performance metrics]
end
%% Demo Deployment
subgraph "Demo Deployment"
SPACE_REPO[HF Space Repository<br/>username/model-name-demo<br/>Demo hosting]
DEMO_APP[Demo Application<br/>Gradio interface<br/>Real-time inference]
ENV_VARS[Environment Config<br/>HF_MODEL_ID<br/>MODEL_NAME<br/>secrets]
end
%% External Data Sources
subgraph "External Data Sources"
GRANARY[NVIDIA Granary<br/>Multilingual ASR data<br/>25+ languages]
HF_COMM[HF Community Datasets<br/>Public ASR datasets<br/>Standard formats]
end
%% Data Flow Connections
MIC --> AUDIO_PROC
FILE --> AUDIO_PROC
TEXT --> TEXT_PROC
LANG --> TEXT_PROC
AUDIO_PROC --> JSONL_CONV
TEXT_PROC --> JSONL_CONV
JSONL_CONV --> LOCAL_DS
LOCAL_DS --> HF_DS
LOCAL_DS --> DS_LOADER
HF_DS --> DS_LOADER
GRANARY --> DS_LOADER
HF_COMM --> DS_LOADER
DS_LOADER --> AUDIO_CAST
AUDIO_CAST --> TRAIN_SPLIT
AUDIO_CAST --> EVAL_SPLIT
TRAIN_SPLIT --> COLLATOR
EVAL_SPLIT --> COLLATOR
COLLATOR --> FORWARD
FORWARD --> LOSS
LOSS --> BACKWARD
BACKWARD --> OPTIMIZE
OPTIMIZE --> MODEL_FILES
OPTIMIZE --> TRAINING_LOGS
OPTIMIZE --> CHECKPOINTS
MODEL_FILES --> HF_REPO
TRAINING_LOGS --> HF_REPO
CHECKPOINTS --> HF_REPO
HF_REPO --> MODEL_CARD
TRAINING_LOGS --> MODEL_CARD
MODEL_CARD --> SPACE_REPO
HF_REPO --> SPACE_REPO
ENV_VARS --> SPACE_REPO
SPACE_REPO --> DEMO_APP
%% Styling
classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
classDef processing fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
classDef storage fill:#fff3e0,stroke:#f57c00,stroke-width:2px
classDef training fill:#e8f5e8,stroke:#388e3c,stroke-width:2px
classDef output fill:#fce4ec,stroke:#c2185b,stroke-width:2px
classDef publishing fill:#e1f5fe,stroke:#0277bd,stroke-width:2px
classDef deployment fill:#f5f5f5,stroke:#424242,stroke-width:2px
classDef external fill:#efebe9,stroke:#5d4037,stroke-width:2px
class MIC,FILE,TEXT,LANG input
class AUDIO_PROC,TEXT_PROC,JSONL_CONV processing
class LOCAL_DS,HF_DS storage
class DS_LOADER,AUDIO_CAST,TRAIN_SPLIT,EVAL_SPLIT,COLLATOR,FORWARD,LOSS,BACKWARD,OPTIMIZE training
class MODEL_FILES,TRAINING_LOGS,CHECKPOINTS output
class HF_REPO,MODEL_CARD,METADATA publishing
class SPACE_REPO,DEMO_APP,ENV_VARS deployment
class GRANARY,HF_COMM external
Data Flow Overview
This diagram illustrates the complete data flow through the Voxtral ASR Fine-tuning application, from user input to deployed demo.
Data Input Sources
User-Generated Data
- Microphone Recording: Raw audio captured through browser microphone
- File Upload: Existing WAV/FLAC audio files
- Manual Transcripts: User-provided text transcriptions
- Language Selection: Influences phrase selection from NVIDIA Granary
External Data Sources
- NVIDIA Granary: High-quality multilingual ASR dataset
- HF Community Datasets: Public datasets from Hugging Face Hub
Data Processing Pipeline
Audio Processing
# Audio resampling and format conversion
audio_data = librosa.load(audio_path, sr=16000)
# Convert to WAV format for consistency
sf.write(output_path, audio_data, 16000)
Text Processing
# Text cleaning and validation
text = text.strip()
# Basic validation (length, content checks)
assert len(text) > 0, "Empty transcription"
JSONL Conversion
# Standard format for all datasets
entry = {
"audio_path": str(audio_file_path),
"text": cleaned_transcription
}
# Write to JSONL file
with open(jsonl_path, "a") as f:
f.write(json.dumps(entry) + "\n")
Dataset Storage
Local Storage Structure
datasets/voxtral_user/
βββ data.jsonl # Main dataset file
βββ recorded_data.jsonl # From recordings
βββ wavs/ # Audio files
βββ recording_0000.wav
βββ recording_0001.wav
βββ ...
HF Hub Storage
- Public Datasets: Shareable with community
- Version Control: Dataset versioning and updates
- Standard Metadata: Automatic README generation
Training Data Pipeline
Dataset Loading
# Load local JSONL
ds = _load_jsonl_dataset("datasets/voxtral_user/data.jsonl")
# Load HF dataset
ds = load_dataset("username/dataset-name", split="train")
Audio Casting
# Ensure consistent sampling rate
ds = ds.cast_column("audio", Audio(sampling_rate=16000))
Train/Eval Split
# Create train and eval datasets
train_dataset = ds.select(range(train_count))
eval_dataset = ds.select(range(train_count, train_count + eval_count))
Training Process Flow
Data Collation
- VoxtralDataCollator: Custom collator for Voxtral model
- Audio Processing: Convert audio to model inputs
- Prompt Construction: Build
[AUDIO]...[AUDIO] <transcribe>
prompts - Text Tokenization: Process transcription targets
- Masking: Mask audio prompt tokens during training
Forward Pass
- Audio Input: Raw audio waveforms
- Audio Tower: Extract audio features
- Language Model: Generate transcription autoregressively
- Loss Calculation: Compare generated vs target text
Backward Pass & Optimization
- Gradient Computation: Backpropagation
- LoRA Updates: Update only adapter parameters (LoRA mode)
- Full Updates: Update all parameters (full fine-tuning)
- Optimizer Step: Apply gradients with learning rate scheduling
Training Outputs
Model Files
- model.safetensors: Model weights (safetensors format)
- config.json: Model configuration
- tokenizer.json: Tokenizer configuration
- generation_config.json: Generation parameters
Training Logs
- train_results.json: Final training metrics
- eval_results.json: Evaluation results
- training_config.json: Training hyperparameters
- trainer_state.json: Training state and checkpoints
Checkpoints
- checkpoint-XXX/: Intermediate model snapshots
- best-model/: Best performing model
- final-model/: Final trained model
Publishing Pipeline
HF Repository Structure
username/model-name/
βββ model.safetensors.index.json
βββ model-00001-of-00002.safetensors
βββ model-00002-of-00002.safetensors
βββ config.json
βββ tokenizer.json
βββ training_config.json
βββ train_results.json
βββ README.md (model card)
βββ training_results/
βββ training.log
Model Card Generation
- Template Processing: Fill model_card.md template
- Variable Injection: Training config, results, metadata
- Conditional Sections: Handle quantized models, etc.
Demo Deployment
Space Repository Structure
username/model-name-demo/
βββ app.py # Gradio demo application
βββ requirements.txt # Python dependencies
βββ README.md # Space documentation
βββ .env # Environment variables
Environment Configuration
# Space environment variables
HF_MODEL_ID=username/model-name
MODEL_NAME=Voxtral Fine-tuned Model
HF_TOKEN=read_only_token # For model access
BRAND_OWNER_NAME=username
# ... other branding variables
Data Flow Patterns
Streaming vs Batch Processing
- Training Data: Batch processing for efficiency
- External Datasets: Streaming loading for memory efficiency
- User Input: Real-time processing with immediate feedback
Data Validation
- Input Validation: Check audio format, sampling rate, text length
- Quality Assurance: Filter out empty or invalid entries
- Consistency Checks: Ensure audio-text alignment
Error Handling
- Graceful Degradation: Fallback to local data if external sources fail
- Retry Logic: Automatic retry for network failures
- Logging: Comprehensive error logging and debugging
Performance Considerations
Memory Management
- Streaming Loading: Process large datasets without loading everything
- Audio Caching: Cache processed audio features
- Batch Optimization: Balance batch size with available memory
Storage Optimization
- Compression: Use efficient audio formats
- Deduplication: Avoid duplicate data entries
- Cleanup: Remove temporary files after processing
Network Efficiency
- Incremental Uploads: Upload files as they're ready
- Resume Capability: Resume interrupted uploads
- Caching: Cache frequently accessed data
Security & Privacy
Data Privacy
- Local Processing: Audio files processed locally when possible
- User Consent: Clear data usage policies
- Anonymization: Remove personally identifiable information
Access Control
- Token Management: Secure HF token storage
- Repository Permissions: Appropriate public/private settings
- Rate Limiting: Prevent abuse of demo interfaces
Monitoring & Analytics
Data Quality Metrics
- Audio Quality: Sampling rate, format validation
- Text Quality: Length, language detection, consistency
- Dataset Statistics: Size, distribution, coverage
Performance Metrics
- Processing Time: Data loading, preprocessing, training time
- Model Metrics: Loss, perplexity, WER (if available)
- Resource Usage: Memory, CPU/GPU utilization
User Analytics
- Usage Patterns: Popular languages, dataset sizes
- Success Rates: Training completion, deployment success
- Error Patterns: Common failure modes and solutions
See also: