🎡 AST Music vs Speech Classifier (45K)

Fine-tuned Audio Spectrogram Transformer (AST) for music vs speech classification.

Model Details

  • Base Model: MIT/ast-finetuned-audioset-10-10-0.4593
  • Task: Binary Audio Classification (Music vs Speech)
  • Training Dataset: AIGenLab/speech-music-45k (45000 samples)
  • Overall Accuracy: 86.7% (26/30)

πŸ“Š Performance Results

Category Accuracy Correct Total
Pure Music 100.0% 10 10
Pure Speech 70.0% 7 10
Speech + Music 90.0% 9 10

Pure Music

File Music Score Speech Score Prediction Result
music_1.wav 1.000 0.000 MUSIC βœ…
music_10.wav 0.999 0.001 MUSIC βœ…
music_2.wav 0.999 0.001 MUSIC βœ…
music_3.wav 0.999 0.001 MUSIC βœ…
music_4.wav 1.000 0.000 MUSIC βœ…
music_5.wav 0.996 0.004 MUSIC βœ…
music_6.wav 1.000 0.000 MUSIC βœ…
music_7.wav 0.998 0.002 MUSIC βœ…
music_8.wav 1.000 0.000 MUSIC βœ…
music_9.wav 1.000 0.000 MUSIC βœ…

Pure Speech

File Music Score Speech Score Prediction Result
speech_1.wav 0.000 1.000 SPEECH βœ…
speech_10.wav 0.000 1.000 SPEECH βœ…
speech_2.wav 0.000 1.000 SPEECH βœ…
speech_3.wav 0.824 0.176 MUSIC ❌
speech_4.wav 0.978 0.022 MUSIC ❌
speech_5.wav 1.000 0.000 MUSIC ❌
speech_6.wav 0.038 0.962 SPEECH βœ…
speech_7.wav 0.003 0.997 SPEECH βœ…
speech_8.wav 0.001 0.999 SPEECH βœ…
speech_9.wav 0.000 1.000 SPEECH βœ…

Speech + Music

File Music Score Speech Score Prediction Result
speech_and_music_1.wav 1.000 0.000 MUSIC βœ…
speech_and_music_10.wav 1.000 0.000 MUSIC βœ…
speech_and_music_2.wav 1.000 0.000 MUSIC βœ…
speech_and_music_3wav.wav 1.000 0.000 MUSIC βœ…
speech_and_music_4.wav 1.000 0.000 MUSIC βœ…
speech_and_music_5.wav 1.000 0.000 MUSIC βœ…
speech_and_music_6.wav 1.000 0.000 MUSIC βœ…
speech_and_music_7.wav 0.353 0.647 SPEECH ❌
speech_and_music_8.wav 1.000 0.000 MUSIC βœ…
speech_and_music_9.wav 1.000 0.000 MUSIC βœ…

πŸš€ Quick Start

from transformers import pipeline

# Load the model
classifier = pipeline(
    "audio-classification", 
    model="AIGenLab/AST-speech-and-music-45K"
)

# Classify audio
result = classifier("your_audio.wav")
print(result)

πŸ”§ Advanced Usage

from transformers import AutoModelForAudioClassification, AutoFeatureExtractor
import torch
import torchaudio

# Load model and feature extractor
model = AutoModelForAudioClassification.from_pretrained(
    "AIGenLab/AST-speech-and-music-45K"
)
feature_extractor = AutoFeatureExtractor.from_pretrained(
    "AIGenLab/AST-speech-and-music-45K"
)

# Load audio (16kHz required)
audio, sr = torchaudio.load("audio.wav")
if sr != 16000:
    audio = torchaudio.functional.resample(audio, sr, 16000)

# Process
inputs = feature_extractor(
    audio.squeeze().numpy(), 
    sampling_rate=16000, 
    return_tensors="pt"
)

# Predict
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)

music_score = predictions[0][0].item()
speech_score = predictions[0][1].item()

print(f"Music: {music_score:.3f}")
print(f"Speech: {speech_score:.3f}")

πŸ“Š Training Details

Parameter Value
Base Model MIT/ast-finetuned-audioset-10-10-0.4593
Dataset AIGenLab/speech-music-45k (45000 samples)
Epochs 1
Batch Size 64
Learning Rate 3e-5
Loss Weight Music: 2.5x, Speech: 1.0x
Optimizer AdamW
Framework Transformers + PyTorch
Downloads last month
6
Safetensors
Model size
86.2M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Vyvo-Research/AST-Music-Classifier-45K

Finetuned
(149)
this model

Dataset used to train Vyvo-Research/AST-Music-Classifier-45K

Collection including Vyvo-Research/AST-Music-Classifier-45K