S-SONDO: Self-Supervised Knowledge Distillation for General Audio Foundation Models

S-SONDO distills large audio foundation models into lightweight students that are up to 61x smaller while retaining up to 96% of teacher performance — using only output embeddings.

Paper: S-SONDO: Self-Supervised Knowledge Distillation for General Audio Foundation Models (ICASSP 2026)

Authors: Mohammed Ali El Adlouni*, Aurian Quelennec*, Pierre Chouteau, Geoffroy Peeters, Slim Essid

Affiliation: LTCI, Télécom Paris, Institut Polytechnique de Paris

Quick Start

pip install ssondo

import torchaudio
from ssondo import get_ssondo

# Load model (auto-downloads and caches)
model = get_ssondo("matpac-mobilenetv3")

# Load audio
x, sr = torchaudio.load("audio.wav")
x = x.mean(dim=0, keepdim=True)  # mono

# Extract embeddings
embeddings = model(x)  # (1, n_segments, 960)

Available Checkpoints

Model	Teacher	Student	Params	Embedding Size	Status
`matpac-mobilenetv3`	MATPAC++	MobileNetV3	9.1M	960	✅ Available
`matpac-dymn`	MATPAC++	DyMN	—	960	🔜 Coming soon
`matpac-eres2net`	MATPAC++	ERes2Net	—	varies	🔜 Coming soon
`m2d-mobilenetv3`	M2D	MobileNetV3	9.1M	960	🔜 Coming soon
`m2d-dymn`	M2D	DyMN	—	960	🔜 Coming soon
`m2d-eres2net`	M2D	ERes2Net	—	varies	🔜 Coming soon

Model Details

Architecture

The matpac-mobilenetv3 model consists of:

Backbone: MobileNetV3 (2.9M params, pretrained on ImageNet)
Classification Head: MLP projecting 960-dim embeddings to 3840-dim teacher space
Total Parameters: 9.1M

Training

Teacher: MATPAC++ (matpac_plus_6s_2048_enconly.pt)
Dataset: AudioSet (2M+ clips)
Loss: Cosine similarity between student and teacher embeddings
Sampling: Cluster-aware balanced sampling (50 clusters via MiniBatchKMeans)
Preprocessing: 10s audio windows → 128-band log-mel spectrograms (32kHz, 32ms window, 16ms hop)

Input

Format: Raw mono audio waveform
Sample Rate: 32,000 Hz
Slicing: Audio is automatically sliced into 10-second non-overlapping segments
Spectrogram: 128 mel bands, 50-16000 Hz, 32ms window, 16ms hop

Output

Embeddings: (batch, n_segments, 960) — general-purpose audio representations
Logits (optional): (batch, 3840) — projection into teacher embedding space

Usage Examples

Extract Embeddings

from ssondo import get_ssondo

model = get_ssondo("matpac-mobilenetv3")
embeddings = model(audio)  # (batch, n_segments, 960)

With Logits

model = get_ssondo("matpac-mobilenetv3", return_logits=True)
embeddings, logits = model(audio)

GPU Inference

model = get_ssondo("matpac-mobilenetv3", device="cuda")
embeddings = model(audio.cuda())

List Available Models

from ssondo import list_models
for name, desc in list_models().items():
    print(f"{name}: {desc}")

Training Code

Full training pipeline available at: github.com/MedAliAdlouni/ssondo_temp

Citation

@inproceedings{eladlouni2026ssondo,
  title={S-SONDO: Self-Supervised Knowledge Distillation for General Audio Foundation Models},
  author={El Adlouni, Mohammed Ali and Quelennec, Aurian and Chouteau, Pierre and Peeters, Geoffroy and Essid, Slim},
  booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2026}
}

License

MIT

Downloads last month: -; Downloads are not tracked for this model. How to track