π΅ AST Music vs Speech Classifier (45K)
Fine-tuned Audio Spectrogram Transformer (AST) for music vs speech classification.
Model Details
- Base Model: MIT/ast-finetuned-audioset-10-10-0.4593
- Task: Binary Audio Classification (Music vs Speech)
- Training Dataset: AIGenLab/speech-music-45k (45000 samples)
- Overall Accuracy: 86.7% (26/30)
π Performance Results
| Category |
Accuracy |
Correct |
Total |
| Pure Music |
100.0% |
10 |
10 |
| Pure Speech |
70.0% |
7 |
10 |
| Speech + Music |
90.0% |
9 |
10 |
Pure Music
| File |
Music Score |
Speech Score |
Prediction |
Result |
| music_1.wav |
1.000 |
0.000 |
MUSIC |
β
|
| music_10.wav |
0.999 |
0.001 |
MUSIC |
β
|
| music_2.wav |
0.999 |
0.001 |
MUSIC |
β
|
| music_3.wav |
0.999 |
0.001 |
MUSIC |
β
|
| music_4.wav |
1.000 |
0.000 |
MUSIC |
β
|
| music_5.wav |
0.996 |
0.004 |
MUSIC |
β
|
| music_6.wav |
1.000 |
0.000 |
MUSIC |
β
|
| music_7.wav |
0.998 |
0.002 |
MUSIC |
β
|
| music_8.wav |
1.000 |
0.000 |
MUSIC |
β
|
| music_9.wav |
1.000 |
0.000 |
MUSIC |
β
|
Pure Speech
| File |
Music Score |
Speech Score |
Prediction |
Result |
| speech_1.wav |
0.000 |
1.000 |
SPEECH |
β
|
| speech_10.wav |
0.000 |
1.000 |
SPEECH |
β
|
| speech_2.wav |
0.000 |
1.000 |
SPEECH |
β
|
| speech_3.wav |
0.824 |
0.176 |
MUSIC |
β |
| speech_4.wav |
0.978 |
0.022 |
MUSIC |
β |
| speech_5.wav |
1.000 |
0.000 |
MUSIC |
β |
| speech_6.wav |
0.038 |
0.962 |
SPEECH |
β
|
| speech_7.wav |
0.003 |
0.997 |
SPEECH |
β
|
| speech_8.wav |
0.001 |
0.999 |
SPEECH |
β
|
| speech_9.wav |
0.000 |
1.000 |
SPEECH |
β
|
Speech + Music
| File |
Music Score |
Speech Score |
Prediction |
Result |
| speech_and_music_1.wav |
1.000 |
0.000 |
MUSIC |
β
|
| speech_and_music_10.wav |
1.000 |
0.000 |
MUSIC |
β
|
| speech_and_music_2.wav |
1.000 |
0.000 |
MUSIC |
β
|
| speech_and_music_3wav.wav |
1.000 |
0.000 |
MUSIC |
β
|
| speech_and_music_4.wav |
1.000 |
0.000 |
MUSIC |
β
|
| speech_and_music_5.wav |
1.000 |
0.000 |
MUSIC |
β
|
| speech_and_music_6.wav |
1.000 |
0.000 |
MUSIC |
β
|
| speech_and_music_7.wav |
0.353 |
0.647 |
SPEECH |
β |
| speech_and_music_8.wav |
1.000 |
0.000 |
MUSIC |
β
|
| speech_and_music_9.wav |
1.000 |
0.000 |
MUSIC |
β
|
π Quick Start
from transformers import pipeline
classifier = pipeline(
"audio-classification",
model="AIGenLab/AST-speech-and-music-45K"
)
result = classifier("your_audio.wav")
print(result)
π§ Advanced Usage
from transformers import AutoModelForAudioClassification, AutoFeatureExtractor
import torch
import torchaudio
model = AutoModelForAudioClassification.from_pretrained(
"AIGenLab/AST-speech-and-music-45K"
)
feature_extractor = AutoFeatureExtractor.from_pretrained(
"AIGenLab/AST-speech-and-music-45K"
)
audio, sr = torchaudio.load("audio.wav")
if sr != 16000:
audio = torchaudio.functional.resample(audio, sr, 16000)
inputs = feature_extractor(
audio.squeeze().numpy(),
sampling_rate=16000,
return_tensors="pt"
)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
music_score = predictions[0][0].item()
speech_score = predictions[0][1].item()
print(f"Music: {music_score:.3f}")
print(f"Speech: {speech_score:.3f}")
π Training Details
| Parameter |
Value |
| Base Model |
MIT/ast-finetuned-audioset-10-10-0.4593 |
| Dataset |
AIGenLab/speech-music-45k (45000 samples) |
| Epochs |
1 |
| Batch Size |
64 |
| Learning Rate |
3e-5 |
| Loss Weight |
Music: 2.5x, Speech: 1.0x |
| Optimizer |
AdamW |
| Framework |
Transformers + PyTorch |