π΅ Music Genre MoE β Mixture of Experts for Genre Identification (Work in Progress)
A PyTorch implementation of a Mixture of Experts (MoE) architecture for music genre classification. Instead of one monolithic classifier, the model maintains five specialised genre expert networks β Jazz, Blues, Rock, Pop, and Classical β each trained to recognise the distinctive acoustic fingerprint of its genre. A learned gating network dynamically routes each audio clip to the right combination of experts, enabling soft, interpretable predictions across genre boundaries.
Architecture Overview
Why MoE for music?
Genres are not hard categories β a track can be simultaneously bluesy and rocky, or jazzy and classical. Hard classifiers force a binary decision. The MoE approach produces soft routing weights that naturally represent these overlapping memberships, and the per-expert activations are directly human-readable as confidence scores.
Key Components
GenreExpert
A small residual MLP with two skip-connection blocks. Each expert is independently initialised and learns to respond strongly to its own genre's acoustic patterns while ignoring others. Uses LayerNorm and GELU activations throughout.
GatingNetwork
A 3-layer MLP that takes the shared 128-dim representation and outputs softmax weights over the five experts. During training, Gaussian noise is injected into the gate logits to encourage exploration of all experts and prevent routing collapse.
MusicGenreMoE
The full model pipeline:
- Shared encoder β two-layer MLP projects raw features to a 128-dim latent space shared by all experts
- Expert forward passes β all five experts process the shared representation in parallel
- Gating β soft weights computed from the shared representation
- Weighted fusion β
Ξ£ wα΅’ Β· expertα΅’(x)merges expert outputs proportionally - Classifier head β two-layer MLP maps the fused representation to genre logits
MoELoss
A composite training objective:
| Term | Purpose |
|---|---|
| Cross-entropy (label smoothing 0.1) | Primary classification signal |
| Load-balance loss (Ξ»=0.05) | Prevents one expert from handling everything β penalises routing imbalance (Switch Transformer formulation) |
| Diversity / entropy loss (Ξ»=0.01) | Maximises entropy of gate weights; encourages the gating network to use all experts |
Feature Extraction
Every audio clip is mapped to a fixed 59-dimensional feature vector by librosa:
| Feature Group | Dims | Captures |
|---|---|---|
| MFCC mean | 20 | Timbre, spectral envelope |
| MFCC std | 20 | Timbre variation / texture |
| Chroma mean | 12 | Harmonic / tonal content |
| Spectral centroid | 1 | Brightness |
| Spectral bandwidth | 1 | Frequency spread |
| Spectral rolloff | 1 | High-frequency energy boundary |
| Zero crossing rate | 1 | Noisiness / percussiveness |
| RMS energy | 1 | Loudness / dynamics |
| Tempo | 1 | BPM estimate |
| Harmonic ratio | 1 | Tonal vs. noise content |
Synthetic Audio Generators
No external dataset is required to get started. Each genre generator produces audio with authentic acoustic signatures:
| Genre | Synthesis characteristics |
|---|---|
| Jazz | Swing timing (long-short 8th notes), chromatic passing tones, 7th and 9th harmonics |
| Blues | 12-bar-style phrasing, pentatonic scale, pitch bends, shuffle hi-hat pattern |
| Rock | Distorted odd harmonics (tanh clipping), power chord intervals, kick/snare on alternating beats |
| Pop | Clean major-scale melody, verse-chorus structure, four-on-the-floor kick, compressed dynamics |
| Classical | Rich harmonic series (many partials), string vibrato, wide dynamic crescendo, no percussion |
Notebook Contents
The notebook music_genre_moe.ipynb runs end-to-end in 17 sections:
- Imports & Setup β packages, seeds, constants, device detection
- Synthetic Audio Generators β procedural genre synthesis
- Waveform & Spectrogram Visualisation β see each genre's acoustic signature
- Feature Extraction β
extract_features()function with all 59 dimensions - Dataset Generation β 900 samples (180 per genre Γ 5), z-score normalisation
- PyTorch Dataset & DataLoaders β stratified train / val / test split (120 / 30 / 30 per genre)
- Model Architecture β
GenreExpert,GatingNetwork,MusicGenreMoE - Loss Functions β
MoELosswith load-balance and diversity terms - Training Loop β AdamW, cosine annealing LR, gradient clipping, early stopping
- Training Curves β loss and accuracy plots saved to
training_curves.png - Confusion Matrix & Classification Report β per-genre precision, recall, F1
- Expert Gate Weight Analysis β heatmap of which experts activate for which genres
- ROC Curves β one-vs-rest AUC per genre
- Expert Activation Radar Charts β polar plots of routing profile per genre
- Feature Importance β input gradient analysis, top-20 features and group importance
- Predict Your Own Songs β
predict_audio_file()for.mp3/.wavfiles - Save & Load β checkpoint with model weights + normalisation stats
Quickstart
1. Install dependencies
pip install torch torchaudio librosa numpy scikit-learn matplotlib seaborn soundfile nbformat
2. Open the notebook
jupyter notebook music_genre_moe.ipynb
Then Run All Cells. The full pipeline β generation, training, evaluation β completes in a few minutes on CPU and under a minute on GPU.
3. Predict a song of your own
probs, weights, pred = predict_audio_file(
'my_song.mp3',
model,
feat_mean=FEAT_MEAN,
feat_std=FEAT_STD,
)
pretty_predict('My Song', probs, weights, pred)
Output:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π΅ My Song
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Predicted genre : JAZZ (74% confidence)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Genre probabilities:
jazz [ββββββββββββββββββββββββββββββ] 74%
blues [ββββββββββββββββββββββββββββββ] 14%
classical [ββββββββββββββββββββββββββββββ] 8%
rock [ββββββββββββββββββββββββββββββ] 3%
pop [ββββββββββββββββββββββββββββββ] 1%
Expert gate weights:
jazz [βͺβͺβͺβͺβͺβͺβͺβͺβͺβͺβͺβͺβͺβͺβͺΒ·Β·Β·Β·Β·] 0.71
blues [βͺβͺβͺβͺΒ·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·] 0.19
classical [βͺΒ·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·] 0.06
rock [βͺΒ·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·] 0.03
pop [Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·] 0.01
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
4. Load a saved checkpoint
loaded_model, loaded_mean, loaded_std, genres = load_moe_model('moe_genre_checkpoint.pt')
Outputs
Running the notebook produces the following files:
| File | Contents |
|---|---|
moe_genre_model.pt |
Best model state dict (saved during training) |
moe_genre_checkpoint.pt |
Full checkpoint: weights + normalisation stats + genre list |
genre_spectrograms.png |
Waveforms and Mel spectrograms for all 5 genres |
training_curves.png |
Loss and accuracy over training epochs |
confusion_matrix.png |
Test set confusion matrix |
expert_gates.png |
Gate weight heatmap + expert self-activation bar chart |
roc_curves.png |
Per-genre ROC curves with AUC scores |
radar_charts.png |
Polar plots of expert routing profiles per genre |
feature_importance.png |
Top-20 features and group-level importance by gradient analysis |
Publishing to Hugging Face Hub
This repository includes three files that make the model fully compatible with the Hugging Face ecosystem.
| File | Purpose |
|---|---|
modeling.py |
Full model class definitions with PreTrainedModel and PretrainedConfig base classes |
config.json |
Model configuration with auto_map pointing to the custom classes |
upload_to_hub.py |
End-to-end export and push script |
Step 1 β Install Hub dependencies
pip install huggingface_hub transformers safetensors
Step 2 β Log in
huggingface-cli login
Or set the HF_TOKEN environment variable with a token from huggingface.co/settings/tokens.
Step 3 β Run the upload script
python upload_to_hub.py --repo your-username/music-genre-moe
The script:
- Loads
moe_genre_checkpoint.pt(saved by the notebook) - Bakes the real
feat_mean/feat_stdnormalisation stats intoconfig.json - Saves weights as
model.safetensors(orpytorch_model.binif safetensors is unavailable) - Copies
modeling.py, the notebook aswalkthrough.ipynb, and the README - Creates the Hub repo and uploads everything in one commit
Optional flags:
python upload_to_hub.py \
--repo your-username/music-genre-moe \
--checkpoint moe_genre_checkpoint.pt \
--private \
--token hf_xxxx...
Loading from the Hub
Once uploaded, anyone can load and use the model in two lines:
from transformers import AutoModel
model = AutoModel.from_pretrained(
"your-username/music-genre-moe",
trust_remote_code=True,
)
model.eval()
Running inference on a feature vector:
import torch
# features: (batch, 59) tensor β already z-score normalised
features = torch.randn(1, 59)
result = model.predict(features)
# {
# "predicted_genre": "jazz",
# "confidence": 0.7412,
# "probabilities": {"jazz": 0.7412, "blues": 0.1803, ...},
# "gate_weights": {"jazz": 0.7105, "blues": 0.1920, ...},
# }
print(result)
If you have raw (unnormalised) features, pass already_normalised=False and the model will apply the stored stats automatically:
result = model.predict(raw_features, already_normalised=False)
What gets published
your-username/music-genre-moe/
βββ config.json β architecture + normalisation stats + auto_map
βββ model.safetensors β trained weights
βββ modeling.py β custom model classes (loaded via trust_remote_code)
βββ walkthrough.ipynb β full training notebook (rendered on Hub, Open in Colab button)
βββ README.md β this file
The notebook renders directly on the Hub model page, and visitors get a one-click Open in Colab button to run the full training pipeline without downloading anything locally.
Extending the Model
Add more genres
Define a new generator function following the pattern of make_jazz, make_blues, etc., and add it to GENRE_GENERATORS. The MusicGenreMoE class is parameterised by n_experts and n_classes β just increase both and retrain.
Use a real dataset
The GTZAN Genre Collection provides 1,000 30-second clips across 10 genres. Load each with librosa.load() and pass through extract_features(). The rest of the pipeline is unchanged.
Sparse Top-K gating
For faster inference, activate only the top-2 experts per input instead of all five. Replace the weighted sum with a sparse Top-K routing step, similar to the Switch Transformer or GShard papers.
Contrastive pre-training
Before end-to-end training, pre-train each expert with a triplet loss using prototype songs as anchors. This gives each expert a strong genre-specific initialisation before the gating network is introduced.
Temporal modelling
Replace the shared MLP encoder with a 1D-CNN or a small LSTM operating over time-windowed feature frames. This captures rhythm evolution and structural patterns (verse-chorus transitions) that the current frame-averaged features miss.
Confidence calibration
Add temperature scaling on the classifier head logits to produce better-calibrated probability estimates, useful when using the output probabilities for downstream decisions.
Model Summary
| Component | Configuration |
|---|---|
| Input dimension | 59 |
| Shared encoder | Linear(59β128) β LayerNorm β GELU β Dropout(0.3) β Linear(128β128) β LayerNorm β GELU |
| Expert hidden dim | 64 |
| Expert output dim | 32 |
| Number of experts | 5 (one per genre) |
| Gating network | Linear(128β64) β GELU β LayerNorm β Linear(64β32) β GELU β Linear(32β5) β Softmax |
| Classifier head | Linear(32β64) β GELU β Dropout(0.15) β Linear(64β5) |
| Total parameters | ~175,000 |
| Optimiser | AdamW (lr=3e-3, weight_decay=1e-4) |
| LR schedule | Cosine annealing (T_max=120, Ξ·_min=1e-5) |
| Training epochs | Up to 120 with early stopping (patience=20) |
| Batch size | 64 |
Evaluation Summary β Synthetic Data
β οΈ All results below are on synthetically generated audio only. Real-world audio evaluation is pending.
The model was trained and evaluated entirely on procedurally generated signals β one deterministic generator per genre (Jazz, Blues, Rock, Pop, Classical). Under these conditions the model achieves perfect scores across every metric, but this should be interpreted carefully.
Results on synthetic data
| Metric | Value |
|---|---|
| Test Accuracy | 100% |
| Macro F1 | 1.00 |
| ROC AUC (all genres) | 1.000 |
These numbers reflect the model learning to distinguish five mathematical signal generators, not five real-world musical genres. The synthetic generators are deterministic enough that the classifier only needs to recognise the acoustic fingerprint of each function β a considerably simpler task than generalising to real recordings.
Training behaviour
Convergence was rapid and stable. Both train and validation loss tracked in lockstep with no divergence, and accuracy reached ~100% by epoch 3. There is no evidence of overfitting β the train/val gap never opened β but the corollary is that the task offered no real opportunity for it to emerge. The evaluation is not a meaningful test of generalisation.
Expert specialisation
The most informative result is the gate weight analysis. Despite perfect classification accuracy, the gating network did not develop the intended per-genre expert routing. Self-activation weights (the fraction of weight each genre's own expert receives) were low across the board:
| Genre | Self-activation |
|---|---|
| Jazz | 8% |
| Blues | 17% |
| Rock | 10% |
| Pop | 1% |
| Classical | 0% |
The model learned to classify correctly by routing most inputs through one or two dominant experts
rather than distributing responsibility across all five. This is a known failure mode in MoE
training called expert collapse, and is the primary issue to address before evaluating on real
audio. Increasing the load-balance loss penalty (lambda_lb) and optionally pre-training each
expert on its own genre in isolation are the recommended next steps.
What comes next
Real-audio evaluation on a dataset such as GTZAN is required before any of the above metrics can be considered meaningful. On real recordings the expectation is:
- Accuracy drops to a more honest 70β90% range
- A genuine train/val gap may emerge, making regularisation decisions consequential
- Expert specialisation becomes harder to achieve and more valuable to diagnose
- Cross-genre tracks (blues-rock, jazz-classical) will stress-test the soft routing mechanism in ways synthetic data cannot
The architecture and training pipeline are validated end-to-end. The synthetic results confirm the implementation is correct. Real-audio benchmarking is the critical next step.
Requirements
torch >= 2.0
torchaudio >= 2.0
librosa >= 0.10
numpy
scikit-learn
matplotlib
seaborn
soundfile
nbformat
Python 3.9 or later recommended.
License
MIT β free to use, modify, and extend.
- Downloads last month
- 20
Evaluation results
- Test Accuracyself-reported0.000
- Macro F1self-reported0.000