S-SONDO: Self-Supervised Knowledge Distillation for General Audio Foundation Models
S-SONDO distills large audio foundation models into lightweight students that are up to 61x smaller while retaining up to 96% of teacher performance β using only output embeddings.
Paper: S-SONDO: Self-Supervised Knowledge Distillation for General Audio Foundation Models (ICASSP 2026)
Authors: Mohammed Ali El Adlouni*, Aurian Quelennec*, Pierre Chouteau, Geoffroy Peeters, Slim Essid
Affiliation: LTCI, TΓ©lΓ©com Paris, Institut Polytechnique de Paris
Quick Start
pip install ssondo
import torchaudio
from ssondo import get_ssondo
# Load model (auto-downloads and caches)
model = get_ssondo("matpac-mobilenetv3")
# Load audio
x, sr = torchaudio.load("audio.wav")
x = x.mean(dim=0, keepdim=True) # mono
# Extract embeddings
embeddings = model(x) # (1, n_segments, 960)
Available Checkpoints
| Model | Teacher | Student | Params | Embedding Size | Status |
|---|---|---|---|---|---|
matpac-mobilenetv3 |
MATPAC++ | MobileNetV3 | 9.1M | 960 | β Available |
matpac-dymn |
MATPAC++ | DyMN | β | 960 | π Coming soon |
matpac-eres2net |
MATPAC++ | ERes2Net | β | varies | π Coming soon |
m2d-mobilenetv3 |
M2D | MobileNetV3 | 9.1M | 960 | π Coming soon |
m2d-dymn |
M2D | DyMN | β | 960 | π Coming soon |
m2d-eres2net |
M2D | ERes2Net | β | varies | π Coming soon |
Model Details
Architecture
The matpac-mobilenetv3 model consists of:
- Backbone: MobileNetV3 (2.9M params, pretrained on ImageNet)
- Classification Head: MLP projecting 960-dim embeddings to 3840-dim teacher space
- Total Parameters: 9.1M
Training
- Teacher: MATPAC++ (
matpac_plus_6s_2048_enconly.pt) - Dataset: AudioSet (2M+ clips)
- Loss: Cosine similarity between student and teacher embeddings
- Sampling: Cluster-aware balanced sampling (50 clusters via MiniBatchKMeans)
- Preprocessing: 10s audio windows β 128-band log-mel spectrograms (32kHz, 32ms window, 16ms hop)
Input
- Format: Raw mono audio waveform
- Sample Rate: 32,000 Hz
- Slicing: Audio is automatically sliced into 10-second non-overlapping segments
- Spectrogram: 128 mel bands, 50-16000 Hz, 32ms window, 16ms hop
Output
- Embeddings:
(batch, n_segments, 960)β general-purpose audio representations - Logits (optional):
(batch, 3840)β projection into teacher embedding space
Usage Examples
Extract Embeddings
from ssondo import get_ssondo
model = get_ssondo("matpac-mobilenetv3")
embeddings = model(audio) # (batch, n_segments, 960)
With Logits
model = get_ssondo("matpac-mobilenetv3", return_logits=True)
embeddings, logits = model(audio)
GPU Inference
model = get_ssondo("matpac-mobilenetv3", device="cuda")
embeddings = model(audio.cuda())
List Available Models
from ssondo import list_models
for name, desc in list_models().items():
print(f"{name}: {desc}")
Training Code
Full training pipeline available at: github.com/MedAliAdlouni/ssondo_temp
Citation
@inproceedings{eladlouni2026ssondo,
title={S-SONDO: Self-Supervised Knowledge Distillation for General Audio Foundation Models},
author={El Adlouni, Mohammed Ali and Quelennec, Aurian and Chouteau, Pierre and Peeters, Geoffroy and Essid, Slim},
booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
year={2026}
}
License
MIT