SAM-Audio ONNX (Small)

ONNX-converted models for SAM-Audio (facebook/sam-audio-small) - Meta's Semantic Audio Modeling for audio source separation.

Model Files

File	Description	Size
`dacvae_encoder.onnx`	Audio encoder (48kHz → latent)	~110 MB
`dacvae_decoder.onnx`	Audio decoder (latent → 48kHz)	~320 MB
`t5_encoder.onnx`	Text encoder (T5-base)	~440 MB
`dit_single_step.onnx`	DiT denoiser (single ODE step)	~2 GB
`vision_encoder.onnx`	Vision encoder (CLIP-based)	~1.2 GB
`peaframe.onnx`	PEAFrame span predictor (audio-text similarity)	~5.8 GB
`tokenizer/`	SentencePiece tokenizer files (T5)	-
`peaframe_tokenizer/`	ModernBERT tokenizer files (PEAFrame)	-
`peaframe_config.json`	PEAFrame scaling parameters	-
`clap_audio_encoder.onnx`	CLAP audio encoder (HTSAT-tiny)	~118 MB
`clap_text_encoder.onnx`	CLAP text encoder (RoBERTa-base)	~481 MB
`clap_tokenizer/`	RoBERTa tokenizer files (CLAP)	-
`clap_config.json`	CLAP audio preprocessing parameters	-

Installation

pip install onnxruntime sentencepiece torchaudio torchvision torchcodec soundfile transformers
# For CUDA support:
pip install onnxruntime-gpu

Usage Examples

Audio-Only Separation

python onnx_inference.py \
    --audio input.wav \
    --text "a person speaking" \
    --output separated.wav

Video-Guided Separation

python onnx_inference.py \
    --video input.mp4 \
    --text "the sound of typing" \
    --output separated.wav

Automatic Span Prediction

Use PEAFrame to automatically detect time spans matching your text description:

python onnx_inference.py \
    --audio input.wav \
    --text "horn" \
    --predict-spans \
    --output separated.wav

This is ideal for long audio where you want to isolate sounds that appear intermittently. The model will automatically detect when the target sound occurs and focus on those segments.

Manual Anchors

Specify exact time spans to focus on (positive anchors) or ignore (negative anchors):

# Focus on specific time ranges
python onnx_inference.py \
    --audio input.wav \
    --text "person speaking" \
    --anchor + 4.5 7.0 \
    --anchor + 12.0 15.5 \
    --output separated.wav

# Ignore specific time ranges
python onnx_inference.py \
    --audio input.wav \
    --text "background music" \
    --anchor - 0.0 3.0 \
    --output separated.wav

CLAP Reranking

Generate multiple candidates and select the best using CLAP audio-text similarity:

python onnx_inference.py \
    --audio input.wav \
    --text "person speaking" \
    --rerank \
    --num-candidates 4 \
    --output separated.wav

Reranking generates multiple separation candidates with different random seeds and uses CLAP to score audio-text similarity, selecting the candidate that best matches the text description. This can improve quality at the cost of ~4x inference time.

Options:

--rerank - Enable reranking mode
--num-candidates N - Number of candidates (default: 4)
--rerank-seed SEED - Random seed for reproducibility

Visual Prompting with SAM3 Mask

# First generate a mask with SAM3 (see generate_sam3_mask.py)
python onnx_inference.py \
    --video input.mp4 \
    --mask object_mask.mp4 \
    --text "" \
    --output isolated.wav \
    --output-video visualization.mp4

Using a Custom Model Directory

python onnx_inference.py \
    --video input.mp4 \
    --text "woman speaking" \
    --model-dir ./my_onnx_models \
    --output separated.wav

Model Specifications

Audio Sample Rate: 48kHz
Audio Hop Length: 1536 samples
Vision Input Size: 336×336 pixels
Text Encoder: T5-base (768-dim)
Vision Encoder: PE-Core-L14-336 (1024-dim)
ODE Solver: Midpoint method (configurable steps, default 16)
PEAFrame: Audio-text similarity model for span detection
- Uses ModernBERT tokenizer
- Processes audio in ~3.3s chunks with 50% overlap
- Default threshold: 0.3
CLAP: Audio-text similarity model for candidate reranking
- Audio encoder: HTSAT-tiny
- Text encoder: RoBERTa-base
- Embedding dimension: 512
- Default candidates: 4

Exporting Models

Export scripts are in the onnx_export/ directory.

Export All Models

python -m onnx_export.export_all --output_dir ./onnx_models

Export Individual Components

# DiT Transformer (supports FP16 for 50% size reduction)
python -m onnx_export.export_dit --output-dir ./onnx_models --model-id facebook/sam-audio-small
python -m onnx_export.export_dit --output-dir ./onnx_models --model-id facebook/sam-audio-large --fp16 --device cuda

# DACVAE (encoder + decoder)
python -m onnx_export.export_dacvae --output-dir ./onnx_models --model-id facebook/sam-audio-small

# T5 Text Encoder
python -m onnx_export.export_t5 --output-dir ./onnx_models --model-id facebook/sam-audio-small

# Vision Encoder
python -m onnx_export.export_vision --model facebook/sam-audio-small --output ./onnx_models

# PEAFrame Span Predictor
python -m onnx_export.export_peaframe --output-dir ./onnx_models --verify

# CLAP Reranking (audio + text encoders)
python -m onnx_export.export_clap --output-dir ./onnx_models --verify

FP16 Quantization (for large models)

For the large model (sam-audio-large), use --fp16 --device cuda during DiT export to reduce size by 50%:

# Export DiT in FP16 (11.7GB → 5.9GB)
python -m onnx_export.export_dit \
    --output-dir ./onnx_models_large_fp16 \
    --model-id facebook/sam-audio-large \
    --fp16 \
    --device cuda

The inference script automatically detects FP16 models and handles input conversion.

Export Scripts Reference

Script	Description
`export_all.py`	Export all components at once
`export_dit.py`	DiT transformer with FP16 support
`export_dacvae.py`	DACVAE encoder and decoder
`export_t5.py`	T5 text encoder
`export_vision.py`	Vision encoder (CLIP-based)
`export_peaframe.py`	PEAFrame span predictor + tokenizer
`export_clap.py`	CLAP audio + text encoders for reranking
`standalone_config.py`	Config classes for standalone export

License

SAM-Audio is released under the CC-BY-NC 4.0 license. See original repository for full terms.

Acknowledgments

Original model by Meta AI Research.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for matbee/sam-audio-small-onnx

Base model

facebook/sam-audio-small

Quantized

(1)

this model