🎵 Music Genre MoE — Mixture of Experts for Genre Identification (Work in Progress)

A PyTorch implementation of a Mixture of Experts (MoE) architecture for music genre classification. Instead of one monolithic classifier, the model maintains five specialised genre expert networks — Jazz, Blues, Rock, Pop, and Classical — each trained to recognise the distinctive acoustic fingerprint of its genre. A learned gating network dynamically routes each audio clip to the right combination of experts, enabling soft, interpretable predictions across genre boundaries.

Architecture Overview

Why MoE for music?

Genres are not hard categories — a track can be simultaneously bluesy and rocky, or jazzy and classical. Hard classifiers force a binary decision. The MoE approach produces soft routing weights that naturally represent these overlapping memberships, and the per-expert activations are directly human-readable as confidence scores.

Key Components

`GenreExpert`

A small residual MLP with two skip-connection blocks. Each expert is independently initialised and learns to respond strongly to its own genre's acoustic patterns while ignoring others. Uses LayerNorm and GELU activations throughout.

`GatingNetwork`

A 3-layer MLP that takes the shared 128-dim representation and outputs softmax weights over the five experts. During training, Gaussian noise is injected into the gate logits to encourage exploration of all experts and prevent routing collapse.

`MusicGenreMoE`

The full model pipeline:

Shared encoder — two-layer MLP projects raw features to a 128-dim latent space shared by all experts
Expert forward passes — all five experts process the shared representation in parallel
Gating — soft weights computed from the shared representation
Weighted fusion — Σ wᵢ · expertᵢ(x) merges expert outputs proportionally
Classifier head — two-layer MLP maps the fused representation to genre logits

`MoELoss`

A composite training objective:

Term	Purpose
Cross-entropy (label smoothing 0.1)	Primary classification signal
Load-balance loss (λ=0.05)	Prevents one expert from handling everything — penalises routing imbalance (Switch Transformer formulation)
Diversity / entropy loss (λ=0.01)	Maximises entropy of gate weights; encourages the gating network to use all experts

Feature Extraction

Every audio clip is mapped to a fixed 59-dimensional feature vector by librosa:

Feature Group	Dims	Captures
MFCC mean	20	Timbre, spectral envelope
MFCC std	20	Timbre variation / texture
Chroma mean	12	Harmonic / tonal content
Spectral centroid	1	Brightness
Spectral bandwidth	1	Frequency spread
Spectral rolloff	1	High-frequency energy boundary
Zero crossing rate	1	Noisiness / percussiveness
RMS energy	1	Loudness / dynamics
Tempo	1	BPM estimate
Harmonic ratio	1	Tonal vs. noise content

Synthetic Audio Generators

No external dataset is required to get started. Each genre generator produces audio with authentic acoustic signatures:

Genre	Synthesis characteristics
Jazz	Swing timing (long-short 8th notes), chromatic passing tones, 7th and 9th harmonics
Blues	12-bar-style phrasing, pentatonic scale, pitch bends, shuffle hi-hat pattern
Rock	Distorted odd harmonics (`tanh` clipping), power chord intervals, kick/snare on alternating beats
Pop	Clean major-scale melody, verse-chorus structure, four-on-the-floor kick, compressed dynamics
Classical	Rich harmonic series (many partials), string vibrato, wide dynamic crescendo, no percussion

Notebook Contents

The notebook music_genre_moe.ipynb runs end-to-end in 17 sections:

Imports & Setup — packages, seeds, constants, device detection
Synthetic Audio Generators — procedural genre synthesis
Waveform & Spectrogram Visualisation — see each genre's acoustic signature
Feature Extraction — extract_features() function with all 59 dimensions
Dataset Generation — 900 samples (180 per genre × 5), z-score normalisation
PyTorch Dataset & DataLoaders — stratified train / val / test split (120 / 30 / 30 per genre)
Model Architecture — GenreExpert, GatingNetwork, MusicGenreMoE
Loss Functions — MoELoss with load-balance and diversity terms
Training Loop — AdamW, cosine annealing LR, gradient clipping, early stopping
Training Curves — loss and accuracy plots saved to training_curves.png
Confusion Matrix & Classification Report — per-genre precision, recall, F1
Expert Gate Weight Analysis — heatmap of which experts activate for which genres
ROC Curves — one-vs-rest AUC per genre
Expert Activation Radar Charts — polar plots of routing profile per genre
Feature Importance — input gradient analysis, top-20 features and group importance
Predict Your Own Songs — predict_audio_file() for .mp3 / .wav files
Save & Load — checkpoint with model weights + normalisation stats

Quickstart

1. Install dependencies

pip install torch torchaudio librosa numpy scikit-learn matplotlib seaborn soundfile nbformat

2. Open the notebook

jupyter notebook music_genre_moe.ipynb

Then Run All Cells. The full pipeline — generation, training, evaluation — completes in a few minutes on CPU and under a minute on GPU.

3. Predict a song of your own

probs, weights, pred = predict_audio_file(
    'my_song.mp3',
    model,
    feat_mean=FEAT_MEAN,
    feat_std=FEAT_STD,
)
pretty_predict('My Song', probs, weights, pred)

Output:

═══════════════════════════════════════════════════════
  🎵 My Song
───────────────────────────────────────────────────────
  Predicted genre : JAZZ  (74% confidence)
───────────────────────────────────────────────────────
  Genre probabilities:
    jazz         [████████████████████░░░░░░░░░░] 74%
    blues        [████░░░░░░░░░░░░░░░░░░░░░░░░░░] 14%
    classical    [██░░░░░░░░░░░░░░░░░░░░░░░░░░░░]  8%
    rock         [█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░]  3%
    pop          [░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░]  1%
  Expert gate weights:
    jazz         [▪▪▪▪▪▪▪▪▪▪▪▪▪▪▪·····] 0.71
    blues        [▪▪▪▪·················] 0.19
    classical    [▪····················] 0.06
    rock         [▪····················] 0.03
    pop          [·····················] 0.01
═══════════════════════════════════════════════════════

4. Load a saved checkpoint

loaded_model, loaded_mean, loaded_std, genres = load_moe_model('moe_genre_checkpoint.pt')

Outputs

Running the notebook produces the following files:

File	Contents
`moe_genre_model.pt`	Best model state dict (saved during training)
`moe_genre_checkpoint.pt`	Full checkpoint: weights + normalisation stats + genre list
`genre_spectrograms.png`	Waveforms and Mel spectrograms for all 5 genres
`training_curves.png`	Loss and accuracy over training epochs
`confusion_matrix.png`	Test set confusion matrix
`expert_gates.png`	Gate weight heatmap + expert self-activation bar chart
`roc_curves.png`	Per-genre ROC curves with AUC scores
`radar_charts.png`	Polar plots of expert routing profiles per genre
`feature_importance.png`	Top-20 features and group-level importance by gradient analysis

Publishing to Hugging Face Hub

This repository includes three files that make the model fully compatible with the Hugging Face ecosystem.

File	Purpose
`modeling.py`	Full model class definitions with `PreTrainedModel` and `PretrainedConfig` base classes
`config.json`	Model configuration with `auto_map` pointing to the custom classes
`upload_to_hub.py`	End-to-end export and push script

Step 1 — Install Hub dependencies

pip install huggingface_hub transformers safetensors

Step 2 — Log in

huggingface-cli login

Or set the HF_TOKEN environment variable with a token from huggingface.co/settings/tokens.

Step 3 — Run the upload script

python upload_to_hub.py --repo your-username/music-genre-moe

The script:

Loads moe_genre_checkpoint.pt (saved by the notebook)
Bakes the real feat_mean / feat_std normalisation stats into config.json
Saves weights as model.safetensors (or pytorch_model.bin if safetensors is unavailable)
Copies modeling.py, the notebook as walkthrough.ipynb, and the README
Creates the Hub repo and uploads everything in one commit

Optional flags:

python upload_to_hub.py \
    --repo    your-username/music-genre-moe \
    --checkpoint moe_genre_checkpoint.pt \
    --private \
    --token   hf_xxxx...

Loading from the Hub

Once uploaded, anyone can load and use the model in two lines:

from transformers import AutoModel

model = AutoModel.from_pretrained(
    "your-username/music-genre-moe",
    trust_remote_code=True,
)
model.eval()

Running inference on a feature vector:

import torch

# features: (batch, 59) tensor — already z-score normalised
features = torch.randn(1, 59)

result = model.predict(features)
# {
#   "predicted_genre": "jazz",
#   "confidence": 0.7412,
#   "probabilities": {"jazz": 0.7412, "blues": 0.1803, ...},
#   "gate_weights":  {"jazz": 0.7105, "blues": 0.1920, ...},
# }
print(result)

If you have raw (unnormalised) features, pass already_normalised=False and the model will apply the stored stats automatically:

result = model.predict(raw_features, already_normalised=False)

What gets published

your-username/music-genre-moe/
├── config.json            ← architecture + normalisation stats + auto_map
├── model.safetensors      ← trained weights
├── modeling.py            ← custom model classes (loaded via trust_remote_code)
├── walkthrough.ipynb      ← full training notebook (rendered on Hub, Open in Colab button)
└── README.md              ← this file

The notebook renders directly on the Hub model page, and visitors get a one-click Open in Colab button to run the full training pipeline without downloading anything locally.

Extending the Model

Add more genres

Define a new generator function following the pattern of make_jazz, make_blues, etc., and add it to GENRE_GENERATORS. The MusicGenreMoE class is parameterised by n_experts and n_classes — just increase both and retrain.

Use a real dataset

The GTZAN Genre Collection provides 1,000 30-second clips across 10 genres. Load each with librosa.load() and pass through extract_features(). The rest of the pipeline is unchanged.

Sparse Top-K gating

For faster inference, activate only the top-2 experts per input instead of all five. Replace the weighted sum with a sparse Top-K routing step, similar to the Switch Transformer or GShard papers.

Contrastive pre-training

Before end-to-end training, pre-train each expert with a triplet loss using prototype songs as anchors. This gives each expert a strong genre-specific initialisation before the gating network is introduced.

Temporal modelling

Replace the shared MLP encoder with a 1D-CNN or a small LSTM operating over time-windowed feature frames. This captures rhythm evolution and structural patterns (verse-chorus transitions) that the current frame-averaged features miss.

Confidence calibration

Add temperature scaling on the classifier head logits to produce better-calibrated probability estimates, useful when using the output probabilities for downstream decisions.

Model Summary

Component	Configuration
Input dimension	59
Shared encoder	Linear(59→128) → LayerNorm → GELU → Dropout(0.3) → Linear(128→128) → LayerNorm → GELU
Expert hidden dim	64
Expert output dim	32
Number of experts	5 (one per genre)
Gating network	Linear(128→64) → GELU → LayerNorm → Linear(64→32) → GELU → Linear(32→5) → Softmax
Classifier head	Linear(32→64) → GELU → Dropout(0.15) → Linear(64→5)
Total parameters	~175,000
Optimiser	AdamW (lr=3e-3, weight_decay=1e-4)
LR schedule	Cosine annealing (T_max=120, η_min=1e-5)
Training epochs	Up to 120 with early stopping (patience=20)
Batch size	64

Evaluation Summary — Synthetic Data

⚠️ All results below are on synthetically generated audio only. Real-world audio evaluation is pending.

The model was trained and evaluated entirely on procedurally generated signals — one deterministic generator per genre (Jazz, Blues, Rock, Pop, Classical). Under these conditions the model achieves perfect scores across every metric, but this should be interpreted carefully.

Results on synthetic data

Metric	Value
Test Accuracy	100%
Macro F1	1.00
ROC AUC (all genres)	1.000

These numbers reflect the model learning to distinguish five mathematical signal generators, not five real-world musical genres. The synthetic generators are deterministic enough that the classifier only needs to recognise the acoustic fingerprint of each function — a considerably simpler task than generalising to real recordings.

Training behaviour

Convergence was rapid and stable. Both train and validation loss tracked in lockstep with no divergence, and accuracy reached ~100% by epoch 3. There is no evidence of overfitting — the train/val gap never opened — but the corollary is that the task offered no real opportunity for it to emerge. The evaluation is not a meaningful test of generalisation.

Expert specialisation

The most informative result is the gate weight analysis. Despite perfect classification accuracy, the gating network did not develop the intended per-genre expert routing. Self-activation weights (the fraction of weight each genre's own expert receives) were low across the board:

Genre	Self-activation
Jazz	8%
Blues	17%
Rock	10%
Pop	1%
Classical	0%

The model learned to classify correctly by routing most inputs through one or two dominant experts rather than distributing responsibility across all five. This is a known failure mode in MoE training called expert collapse, and is the primary issue to address before evaluating on real audio. Increasing the load-balance loss penalty (lambda_lb) and optionally pre-training each expert on its own genre in isolation are the recommended next steps.

What comes next

Real-audio evaluation on a dataset such as GTZAN is required before any of the above metrics can be considered meaningful. On real recordings the expectation is:

Accuracy drops to a more honest 70–90% range
A genuine train/val gap may emerge, making regularisation decisions consequential
Expert specialisation becomes harder to achieve and more valuable to diagnose
Cross-genre tracks (blues-rock, jazz-classical) will stress-test the soft routing mechanism in ways synthetic data cannot

The architecture and training pipeline are validated end-to-end. The synthetic results confirm the implementation is correct. Real-audio benchmarking is the critical next step.

Requirements

torch >= 2.0
torchaudio >= 2.0
librosa >= 0.10
numpy
scikit-learn
matplotlib
seaborn
soundfile
nbformat

Python 3.9 or later recommended.

License

MIT — free to use, modify, and extend.

Downloads last month: 20

Evaluation results

Test Accuracy
self-reported

0.000
Macro F1
self-reported

0.000