SAM-Audio ONNX (Small)

ONNX-converted models for SAM-Audio (facebook/sam-audio-small) - Meta's Semantic Audio Modeling for audio source separation.

Model Files

File Description Size
dacvae_encoder.onnx Audio encoder (48kHz โ†’ latent) ~110 MB
dacvae_decoder.onnx Audio decoder (latent โ†’ 48kHz) ~320 MB
t5_encoder.onnx Text encoder (T5-base) ~440 MB
dit_single_step.onnx DiT denoiser (single ODE step) ~2 GB
vision_encoder.onnx Vision encoder (CLIP-based) ~1.2 GB
peaframe.onnx PEAFrame span predictor (audio-text similarity) ~5.8 GB
tokenizer/ SentencePiece tokenizer files (T5) -
peaframe_tokenizer/ ModernBERT tokenizer files (PEAFrame) -
peaframe_config.json PEAFrame scaling parameters -
clap_audio_encoder.onnx CLAP audio encoder (HTSAT-tiny) ~118 MB
clap_text_encoder.onnx CLAP text encoder (RoBERTa-base) ~481 MB
clap_tokenizer/ RoBERTa tokenizer files (CLAP) -
clap_config.json CLAP audio preprocessing parameters -

Installation

pip install onnxruntime sentencepiece torchaudio torchvision torchcodec soundfile transformers
# For CUDA support:
pip install onnxruntime-gpu

Usage Examples

Audio-Only Separation

python onnx_inference.py \
    --audio input.wav \
    --text "a person speaking" \
    --output separated.wav

Video-Guided Separation

python onnx_inference.py \
    --video input.mp4 \
    --text "the sound of typing" \
    --output separated.wav

Automatic Span Prediction

Use PEAFrame to automatically detect time spans matching your text description:

python onnx_inference.py \
    --audio input.wav \
    --text "horn" \
    --predict-spans \
    --output separated.wav

This is ideal for long audio where you want to isolate sounds that appear intermittently. The model will automatically detect when the target sound occurs and focus on those segments.

Manual Anchors

Specify exact time spans to focus on (positive anchors) or ignore (negative anchors):

# Focus on specific time ranges
python onnx_inference.py \
    --audio input.wav \
    --text "person speaking" \
    --anchor + 4.5 7.0 \
    --anchor + 12.0 15.5 \
    --output separated.wav

# Ignore specific time ranges
python onnx_inference.py \
    --audio input.wav \
    --text "background music" \
    --anchor - 0.0 3.0 \
    --output separated.wav

CLAP Reranking

Generate multiple candidates and select the best using CLAP audio-text similarity:

python onnx_inference.py \
    --audio input.wav \
    --text "person speaking" \
    --rerank \
    --num-candidates 4 \
    --output separated.wav

Reranking generates multiple separation candidates with different random seeds and uses CLAP to score audio-text similarity, selecting the candidate that best matches the text description. This can improve quality at the cost of ~4x inference time.

Options:

  • --rerank - Enable reranking mode
  • --num-candidates N - Number of candidates (default: 4)
  • --rerank-seed SEED - Random seed for reproducibility

Visual Prompting with SAM3 Mask

# First generate a mask with SAM3 (see generate_sam3_mask.py)
python onnx_inference.py \
    --video input.mp4 \
    --mask object_mask.mp4 \
    --text "" \
    --output isolated.wav \
    --output-video visualization.mp4

Using a Custom Model Directory

python onnx_inference.py \
    --video input.mp4 \
    --text "woman speaking" \
    --model-dir ./my_onnx_models \
    --output separated.wav

Model Specifications

  • Audio Sample Rate: 48kHz
  • Audio Hop Length: 1536 samples
  • Vision Input Size: 336ร—336 pixels
  • Text Encoder: T5-base (768-dim)
  • Vision Encoder: PE-Core-L14-336 (1024-dim)
  • ODE Solver: Midpoint method (configurable steps, default 16)
  • PEAFrame: Audio-text similarity model for span detection
    • Uses ModernBERT tokenizer
    • Processes audio in ~3.3s chunks with 50% overlap
    • Default threshold: 0.3
  • CLAP: Audio-text similarity model for candidate reranking
    • Audio encoder: HTSAT-tiny
    • Text encoder: RoBERTa-base
    • Embedding dimension: 512
    • Default candidates: 4

Exporting Models

Export scripts are in the onnx_export/ directory.

Export All Models

python -m onnx_export.export_all --output_dir ./onnx_models

Export Individual Components

# DiT Transformer (supports FP16 for 50% size reduction)
python -m onnx_export.export_dit --output-dir ./onnx_models --model-id facebook/sam-audio-small
python -m onnx_export.export_dit --output-dir ./onnx_models --model-id facebook/sam-audio-large --fp16 --device cuda

# DACVAE (encoder + decoder)
python -m onnx_export.export_dacvae --output-dir ./onnx_models --model-id facebook/sam-audio-small

# T5 Text Encoder
python -m onnx_export.export_t5 --output-dir ./onnx_models --model-id facebook/sam-audio-small

# Vision Encoder
python -m onnx_export.export_vision --model facebook/sam-audio-small --output ./onnx_models

# PEAFrame Span Predictor
python -m onnx_export.export_peaframe --output-dir ./onnx_models --verify

# CLAP Reranking (audio + text encoders)
python -m onnx_export.export_clap --output-dir ./onnx_models --verify

FP16 Quantization (for large models)

For the large model (sam-audio-large), use --fp16 --device cuda during DiT export to reduce size by 50%:

# Export DiT in FP16 (11.7GB โ†’ 5.9GB)
python -m onnx_export.export_dit \
    --output-dir ./onnx_models_large_fp16 \
    --model-id facebook/sam-audio-large \
    --fp16 \
    --device cuda

The inference script automatically detects FP16 models and handles input conversion.

Export Scripts Reference

Script Description
export_all.py Export all components at once
export_dit.py DiT transformer with FP16 support
export_dacvae.py DACVAE encoder and decoder
export_t5.py T5 text encoder
export_vision.py Vision encoder (CLIP-based)
export_peaframe.py PEAFrame span predictor + tokenizer
export_clap.py CLAP audio + text encoders for reranking
standalone_config.py Config classes for standalone export

License

SAM-Audio is released under the CC-BY-NC 4.0 license. See original repository for full terms.

Acknowledgments

Original model by Meta AI Research.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for matbee/sam-audio-small-onnx

Quantized
(1)
this model