S-SONDO: Self-Supervised Knowledge Distillation for General Audio Foundation Models

S-SONDO distills large audio foundation models into lightweight students that are up to 61x smaller while retaining up to 96% of teacher performance β€” using only output embeddings.

Paper: S-SONDO: Self-Supervised Knowledge Distillation for General Audio Foundation Models (ICASSP 2026)

Authors: Mohammed Ali El Adlouni*, Aurian Quelennec*, Pierre Chouteau, Geoffroy Peeters, Slim Essid

Affiliation: LTCI, TΓ©lΓ©com Paris, Institut Polytechnique de Paris

Quick Start

pip install ssondo
import torchaudio
from ssondo import get_ssondo

# Load model (auto-downloads and caches)
model = get_ssondo("matpac-mobilenetv3")

# Load audio
x, sr = torchaudio.load("audio.wav")
x = x.mean(dim=0, keepdim=True)  # mono

# Extract embeddings
embeddings = model(x)  # (1, n_segments, 960)

Available Checkpoints

Model Teacher Student Params Embedding Size Status
matpac-mobilenetv3 MATPAC++ MobileNetV3 9.1M 960 βœ… Available
matpac-dymn MATPAC++ DyMN β€” 960 πŸ”œ Coming soon
matpac-eres2net MATPAC++ ERes2Net β€” varies πŸ”œ Coming soon
m2d-mobilenetv3 M2D MobileNetV3 9.1M 960 πŸ”œ Coming soon
m2d-dymn M2D DyMN β€” 960 πŸ”œ Coming soon
m2d-eres2net M2D ERes2Net β€” varies πŸ”œ Coming soon

Model Details

Architecture

The matpac-mobilenetv3 model consists of:

  • Backbone: MobileNetV3 (2.9M params, pretrained on ImageNet)
  • Classification Head: MLP projecting 960-dim embeddings to 3840-dim teacher space
  • Total Parameters: 9.1M

Training

  • Teacher: MATPAC++ (matpac_plus_6s_2048_enconly.pt)
  • Dataset: AudioSet (2M+ clips)
  • Loss: Cosine similarity between student and teacher embeddings
  • Sampling: Cluster-aware balanced sampling (50 clusters via MiniBatchKMeans)
  • Preprocessing: 10s audio windows β†’ 128-band log-mel spectrograms (32kHz, 32ms window, 16ms hop)

Input

  • Format: Raw mono audio waveform
  • Sample Rate: 32,000 Hz
  • Slicing: Audio is automatically sliced into 10-second non-overlapping segments
  • Spectrogram: 128 mel bands, 50-16000 Hz, 32ms window, 16ms hop

Output

  • Embeddings: (batch, n_segments, 960) β€” general-purpose audio representations
  • Logits (optional): (batch, 3840) β€” projection into teacher embedding space

Usage Examples

Extract Embeddings

from ssondo import get_ssondo

model = get_ssondo("matpac-mobilenetv3")
embeddings = model(audio)  # (batch, n_segments, 960)

With Logits

model = get_ssondo("matpac-mobilenetv3", return_logits=True)
embeddings, logits = model(audio)

GPU Inference

model = get_ssondo("matpac-mobilenetv3", device="cuda")
embeddings = model(audio.cuda())

List Available Models

from ssondo import list_models
for name, desc in list_models().items():
    print(f"{name}: {desc}")

Training Code

Full training pipeline available at: github.com/MedAliAdlouni/ssondo_temp

Citation

@inproceedings{eladlouni2026ssondo,
  title={S-SONDO: Self-Supervised Knowledge Distillation for General Audio Foundation Models},
  author={El Adlouni, Mohammed Ali and Quelennec, Aurian and Chouteau, Pierre and Peeters, Geoffroy and Essid, Slim},
  booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2026}
}

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support