🎡 Music Genre MoE β€” Mixture of Experts for Genre Identification (Work in Progress)

A PyTorch implementation of a Mixture of Experts (MoE) architecture for music genre classification. Instead of one monolithic classifier, the model maintains five specialised genre expert networks β€” Jazz, Blues, Rock, Pop, and Classical β€” each trained to recognise the distinctive acoustic fingerprint of its genre. A learned gating network dynamically routes each audio clip to the right combination of experts, enabling soft, interpretable predictions across genre boundaries.


Architecture Overview

Why MoE for music?

Genres are not hard categories β€” a track can be simultaneously bluesy and rocky, or jazzy and classical. Hard classifiers force a binary decision. The MoE approach produces soft routing weights that naturally represent these overlapping memberships, and the per-expert activations are directly human-readable as confidence scores.


Key Components

GenreExpert

A small residual MLP with two skip-connection blocks. Each expert is independently initialised and learns to respond strongly to its own genre's acoustic patterns while ignoring others. Uses LayerNorm and GELU activations throughout.

GatingNetwork

A 3-layer MLP that takes the shared 128-dim representation and outputs softmax weights over the five experts. During training, Gaussian noise is injected into the gate logits to encourage exploration of all experts and prevent routing collapse.

MusicGenreMoE

The full model pipeline:

  1. Shared encoder β€” two-layer MLP projects raw features to a 128-dim latent space shared by all experts
  2. Expert forward passes β€” all five experts process the shared representation in parallel
  3. Gating β€” soft weights computed from the shared representation
  4. Weighted fusion β€” Ξ£ wα΅’ Β· expertα΅’(x) merges expert outputs proportionally
  5. Classifier head β€” two-layer MLP maps the fused representation to genre logits

MoELoss

A composite training objective:

Term Purpose
Cross-entropy (label smoothing 0.1) Primary classification signal
Load-balance loss (Ξ»=0.05) Prevents one expert from handling everything β€” penalises routing imbalance (Switch Transformer formulation)
Diversity / entropy loss (Ξ»=0.01) Maximises entropy of gate weights; encourages the gating network to use all experts

Feature Extraction

Every audio clip is mapped to a fixed 59-dimensional feature vector by librosa:

Feature Group Dims Captures
MFCC mean 20 Timbre, spectral envelope
MFCC std 20 Timbre variation / texture
Chroma mean 12 Harmonic / tonal content
Spectral centroid 1 Brightness
Spectral bandwidth 1 Frequency spread
Spectral rolloff 1 High-frequency energy boundary
Zero crossing rate 1 Noisiness / percussiveness
RMS energy 1 Loudness / dynamics
Tempo 1 BPM estimate
Harmonic ratio 1 Tonal vs. noise content

Synthetic Audio Generators

No external dataset is required to get started. Each genre generator produces audio with authentic acoustic signatures:

Genre Synthesis characteristics
Jazz Swing timing (long-short 8th notes), chromatic passing tones, 7th and 9th harmonics
Blues 12-bar-style phrasing, pentatonic scale, pitch bends, shuffle hi-hat pattern
Rock Distorted odd harmonics (tanh clipping), power chord intervals, kick/snare on alternating beats
Pop Clean major-scale melody, verse-chorus structure, four-on-the-floor kick, compressed dynamics
Classical Rich harmonic series (many partials), string vibrato, wide dynamic crescendo, no percussion

Notebook Contents

The notebook music_genre_moe.ipynb runs end-to-end in 17 sections:

  1. Imports & Setup β€” packages, seeds, constants, device detection
  2. Synthetic Audio Generators β€” procedural genre synthesis
  3. Waveform & Spectrogram Visualisation β€” see each genre's acoustic signature
  4. Feature Extraction β€” extract_features() function with all 59 dimensions
  5. Dataset Generation β€” 900 samples (180 per genre Γ— 5), z-score normalisation
  6. PyTorch Dataset & DataLoaders β€” stratified train / val / test split (120 / 30 / 30 per genre)
  7. Model Architecture β€” GenreExpert, GatingNetwork, MusicGenreMoE
  8. Loss Functions β€” MoELoss with load-balance and diversity terms
  9. Training Loop β€” AdamW, cosine annealing LR, gradient clipping, early stopping
  10. Training Curves β€” loss and accuracy plots saved to training_curves.png
  11. Confusion Matrix & Classification Report β€” per-genre precision, recall, F1
  12. Expert Gate Weight Analysis β€” heatmap of which experts activate for which genres
  13. ROC Curves β€” one-vs-rest AUC per genre
  14. Expert Activation Radar Charts β€” polar plots of routing profile per genre
  15. Feature Importance β€” input gradient analysis, top-20 features and group importance
  16. Predict Your Own Songs β€” predict_audio_file() for .mp3 / .wav files
  17. Save & Load β€” checkpoint with model weights + normalisation stats

Quickstart

1. Install dependencies

pip install torch torchaudio librosa numpy scikit-learn matplotlib seaborn soundfile nbformat

2. Open the notebook

jupyter notebook music_genre_moe.ipynb

Then Run All Cells. The full pipeline β€” generation, training, evaluation β€” completes in a few minutes on CPU and under a minute on GPU.

3. Predict a song of your own

probs, weights, pred = predict_audio_file(
    'my_song.mp3',
    model,
    feat_mean=FEAT_MEAN,
    feat_std=FEAT_STD,
)
pretty_predict('My Song', probs, weights, pred)

Output:

═══════════════════════════════════════════════════════
  🎡 My Song
───────────────────────────────────────────────────────
  Predicted genre : JAZZ  (74% confidence)
───────────────────────────────────────────────────────
  Genre probabilities:
    jazz         [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘] 74%
    blues        [β–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘] 14%
    classical    [β–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘]  8%
    rock         [β–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘]  3%
    pop          [β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘]  1%
  Expert gate weights:
    jazz         [β–ͺβ–ͺβ–ͺβ–ͺβ–ͺβ–ͺβ–ͺβ–ͺβ–ͺβ–ͺβ–ͺβ–ͺβ–ͺβ–ͺβ–ͺΒ·Β·Β·Β·Β·] 0.71
    blues        [β–ͺβ–ͺβ–ͺβ–ͺΒ·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·] 0.19
    classical    [β–ͺΒ·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·] 0.06
    rock         [β–ͺΒ·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·] 0.03
    pop          [Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·] 0.01
═══════════════════════════════════════════════════════

4. Load a saved checkpoint

loaded_model, loaded_mean, loaded_std, genres = load_moe_model('moe_genre_checkpoint.pt')

Outputs

Running the notebook produces the following files:

File Contents
moe_genre_model.pt Best model state dict (saved during training)
moe_genre_checkpoint.pt Full checkpoint: weights + normalisation stats + genre list
genre_spectrograms.png Waveforms and Mel spectrograms for all 5 genres
training_curves.png Loss and accuracy over training epochs
confusion_matrix.png Test set confusion matrix
expert_gates.png Gate weight heatmap + expert self-activation bar chart
roc_curves.png Per-genre ROC curves with AUC scores
radar_charts.png Polar plots of expert routing profiles per genre
feature_importance.png Top-20 features and group-level importance by gradient analysis

Publishing to Hugging Face Hub

This repository includes three files that make the model fully compatible with the Hugging Face ecosystem.

File Purpose
modeling.py Full model class definitions with PreTrainedModel and PretrainedConfig base classes
config.json Model configuration with auto_map pointing to the custom classes
upload_to_hub.py End-to-end export and push script

Step 1 β€” Install Hub dependencies

pip install huggingface_hub transformers safetensors

Step 2 β€” Log in

huggingface-cli login

Or set the HF_TOKEN environment variable with a token from huggingface.co/settings/tokens.

Step 3 β€” Run the upload script

python upload_to_hub.py --repo your-username/music-genre-moe

The script:

  1. Loads moe_genre_checkpoint.pt (saved by the notebook)
  2. Bakes the real feat_mean / feat_std normalisation stats into config.json
  3. Saves weights as model.safetensors (or pytorch_model.bin if safetensors is unavailable)
  4. Copies modeling.py, the notebook as walkthrough.ipynb, and the README
  5. Creates the Hub repo and uploads everything in one commit

Optional flags:

python upload_to_hub.py \
    --repo    your-username/music-genre-moe \
    --checkpoint moe_genre_checkpoint.pt \
    --private \
    --token   hf_xxxx...

Loading from the Hub

Once uploaded, anyone can load and use the model in two lines:

from transformers import AutoModel

model = AutoModel.from_pretrained(
    "your-username/music-genre-moe",
    trust_remote_code=True,
)
model.eval()

Running inference on a feature vector:

import torch

# features: (batch, 59) tensor β€” already z-score normalised
features = torch.randn(1, 59)

result = model.predict(features)
# {
#   "predicted_genre": "jazz",
#   "confidence": 0.7412,
#   "probabilities": {"jazz": 0.7412, "blues": 0.1803, ...},
#   "gate_weights":  {"jazz": 0.7105, "blues": 0.1920, ...},
# }
print(result)

If you have raw (unnormalised) features, pass already_normalised=False and the model will apply the stored stats automatically:

result = model.predict(raw_features, already_normalised=False)

What gets published

your-username/music-genre-moe/
β”œβ”€β”€ config.json            ← architecture + normalisation stats + auto_map
β”œβ”€β”€ model.safetensors      ← trained weights
β”œβ”€β”€ modeling.py            ← custom model classes (loaded via trust_remote_code)
β”œβ”€β”€ walkthrough.ipynb      ← full training notebook (rendered on Hub, Open in Colab button)
└── README.md              ← this file

The notebook renders directly on the Hub model page, and visitors get a one-click Open in Colab button to run the full training pipeline without downloading anything locally.


Extending the Model

Add more genres

Define a new generator function following the pattern of make_jazz, make_blues, etc., and add it to GENRE_GENERATORS. The MusicGenreMoE class is parameterised by n_experts and n_classes β€” just increase both and retrain.

Use a real dataset

The GTZAN Genre Collection provides 1,000 30-second clips across 10 genres. Load each with librosa.load() and pass through extract_features(). The rest of the pipeline is unchanged.

Sparse Top-K gating

For faster inference, activate only the top-2 experts per input instead of all five. Replace the weighted sum with a sparse Top-K routing step, similar to the Switch Transformer or GShard papers.

Contrastive pre-training

Before end-to-end training, pre-train each expert with a triplet loss using prototype songs as anchors. This gives each expert a strong genre-specific initialisation before the gating network is introduced.

Temporal modelling

Replace the shared MLP encoder with a 1D-CNN or a small LSTM operating over time-windowed feature frames. This captures rhythm evolution and structural patterns (verse-chorus transitions) that the current frame-averaged features miss.

Confidence calibration

Add temperature scaling on the classifier head logits to produce better-calibrated probability estimates, useful when using the output probabilities for downstream decisions.


Model Summary

Component Configuration
Input dimension 59
Shared encoder Linear(59β†’128) β†’ LayerNorm β†’ GELU β†’ Dropout(0.3) β†’ Linear(128β†’128) β†’ LayerNorm β†’ GELU
Expert hidden dim 64
Expert output dim 32
Number of experts 5 (one per genre)
Gating network Linear(128β†’64) β†’ GELU β†’ LayerNorm β†’ Linear(64β†’32) β†’ GELU β†’ Linear(32β†’5) β†’ Softmax
Classifier head Linear(32β†’64) β†’ GELU β†’ Dropout(0.15) β†’ Linear(64β†’5)
Total parameters ~175,000
Optimiser AdamW (lr=3e-3, weight_decay=1e-4)
LR schedule Cosine annealing (T_max=120, Ξ·_min=1e-5)
Training epochs Up to 120 with early stopping (patience=20)
Batch size 64

Evaluation Summary β€” Synthetic Data

⚠️ All results below are on synthetically generated audio only. Real-world audio evaluation is pending.

The model was trained and evaluated entirely on procedurally generated signals β€” one deterministic generator per genre (Jazz, Blues, Rock, Pop, Classical). Under these conditions the model achieves perfect scores across every metric, but this should be interpreted carefully.

Results on synthetic data

Metric Value
Test Accuracy 100%
Macro F1 1.00
ROC AUC (all genres) 1.000

These numbers reflect the model learning to distinguish five mathematical signal generators, not five real-world musical genres. The synthetic generators are deterministic enough that the classifier only needs to recognise the acoustic fingerprint of each function β€” a considerably simpler task than generalising to real recordings.

Training behaviour

Convergence was rapid and stable. Both train and validation loss tracked in lockstep with no divergence, and accuracy reached ~100% by epoch 3. There is no evidence of overfitting β€” the train/val gap never opened β€” but the corollary is that the task offered no real opportunity for it to emerge. The evaluation is not a meaningful test of generalisation.

Expert specialisation

The most informative result is the gate weight analysis. Despite perfect classification accuracy, the gating network did not develop the intended per-genre expert routing. Self-activation weights (the fraction of weight each genre's own expert receives) were low across the board:

Genre Self-activation
Jazz 8%
Blues 17%
Rock 10%
Pop 1%
Classical 0%

The model learned to classify correctly by routing most inputs through one or two dominant experts rather than distributing responsibility across all five. This is a known failure mode in MoE training called expert collapse, and is the primary issue to address before evaluating on real audio. Increasing the load-balance loss penalty (lambda_lb) and optionally pre-training each expert on its own genre in isolation are the recommended next steps.

What comes next

Real-audio evaluation on a dataset such as GTZAN is required before any of the above metrics can be considered meaningful. On real recordings the expectation is:

  • Accuracy drops to a more honest 70–90% range
  • A genuine train/val gap may emerge, making regularisation decisions consequential
  • Expert specialisation becomes harder to achieve and more valuable to diagnose
  • Cross-genre tracks (blues-rock, jazz-classical) will stress-test the soft routing mechanism in ways synthetic data cannot

The architecture and training pipeline are validated end-to-end. The synthetic results confirm the implementation is correct. Real-audio benchmarking is the critical next step.


Requirements

torch >= 2.0
torchaudio >= 2.0
librosa >= 0.10
numpy
scikit-learn
matplotlib
seaborn
soundfile
nbformat

Python 3.9 or later recommended.


License

MIT β€” free to use, modify, and extend.

Downloads last month
20
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Evaluation results