SAM-Audio ONNX (Small)
ONNX-converted models for SAM-Audio (facebook/sam-audio-small) - Meta's Semantic Audio Modeling for audio source separation.
Model Files
| File | Description | Size |
|---|---|---|
dacvae_encoder.onnx |
Audio encoder (48kHz โ latent) | ~110 MB |
dacvae_decoder.onnx |
Audio decoder (latent โ 48kHz) | ~320 MB |
t5_encoder.onnx |
Text encoder (T5-base) | ~440 MB |
dit_single_step.onnx |
DiT denoiser (single ODE step) | ~2 GB |
vision_encoder.onnx |
Vision encoder (CLIP-based) | ~1.2 GB |
peaframe.onnx |
PEAFrame span predictor (audio-text similarity) | ~5.8 GB |
tokenizer/ |
SentencePiece tokenizer files (T5) | - |
peaframe_tokenizer/ |
ModernBERT tokenizer files (PEAFrame) | - |
peaframe_config.json |
PEAFrame scaling parameters | - |
clap_audio_encoder.onnx |
CLAP audio encoder (HTSAT-tiny) | ~118 MB |
clap_text_encoder.onnx |
CLAP text encoder (RoBERTa-base) | ~481 MB |
clap_tokenizer/ |
RoBERTa tokenizer files (CLAP) | - |
clap_config.json |
CLAP audio preprocessing parameters | - |
Installation
pip install onnxruntime sentencepiece torchaudio torchvision torchcodec soundfile transformers
# For CUDA support:
pip install onnxruntime-gpu
Usage Examples
Audio-Only Separation
python onnx_inference.py \
--audio input.wav \
--text "a person speaking" \
--output separated.wav
Video-Guided Separation
python onnx_inference.py \
--video input.mp4 \
--text "the sound of typing" \
--output separated.wav
Automatic Span Prediction
Use PEAFrame to automatically detect time spans matching your text description:
python onnx_inference.py \
--audio input.wav \
--text "horn" \
--predict-spans \
--output separated.wav
This is ideal for long audio where you want to isolate sounds that appear intermittently. The model will automatically detect when the target sound occurs and focus on those segments.
Manual Anchors
Specify exact time spans to focus on (positive anchors) or ignore (negative anchors):
# Focus on specific time ranges
python onnx_inference.py \
--audio input.wav \
--text "person speaking" \
--anchor + 4.5 7.0 \
--anchor + 12.0 15.5 \
--output separated.wav
# Ignore specific time ranges
python onnx_inference.py \
--audio input.wav \
--text "background music" \
--anchor - 0.0 3.0 \
--output separated.wav
CLAP Reranking
Generate multiple candidates and select the best using CLAP audio-text similarity:
python onnx_inference.py \
--audio input.wav \
--text "person speaking" \
--rerank \
--num-candidates 4 \
--output separated.wav
Reranking generates multiple separation candidates with different random seeds and uses CLAP to score audio-text similarity, selecting the candidate that best matches the text description. This can improve quality at the cost of ~4x inference time.
Options:
--rerank- Enable reranking mode--num-candidates N- Number of candidates (default: 4)--rerank-seed SEED- Random seed for reproducibility
Visual Prompting with SAM3 Mask
# First generate a mask with SAM3 (see generate_sam3_mask.py)
python onnx_inference.py \
--video input.mp4 \
--mask object_mask.mp4 \
--text "" \
--output isolated.wav \
--output-video visualization.mp4
Using a Custom Model Directory
python onnx_inference.py \
--video input.mp4 \
--text "woman speaking" \
--model-dir ./my_onnx_models \
--output separated.wav
Model Specifications
- Audio Sample Rate: 48kHz
- Audio Hop Length: 1536 samples
- Vision Input Size: 336ร336 pixels
- Text Encoder: T5-base (768-dim)
- Vision Encoder: PE-Core-L14-336 (1024-dim)
- ODE Solver: Midpoint method (configurable steps, default 16)
- PEAFrame: Audio-text similarity model for span detection
- Uses ModernBERT tokenizer
- Processes audio in ~3.3s chunks with 50% overlap
- Default threshold: 0.3
- CLAP: Audio-text similarity model for candidate reranking
- Audio encoder: HTSAT-tiny
- Text encoder: RoBERTa-base
- Embedding dimension: 512
- Default candidates: 4
Exporting Models
Export scripts are in the onnx_export/ directory.
Export All Models
python -m onnx_export.export_all --output_dir ./onnx_models
Export Individual Components
# DiT Transformer (supports FP16 for 50% size reduction)
python -m onnx_export.export_dit --output-dir ./onnx_models --model-id facebook/sam-audio-small
python -m onnx_export.export_dit --output-dir ./onnx_models --model-id facebook/sam-audio-large --fp16 --device cuda
# DACVAE (encoder + decoder)
python -m onnx_export.export_dacvae --output-dir ./onnx_models --model-id facebook/sam-audio-small
# T5 Text Encoder
python -m onnx_export.export_t5 --output-dir ./onnx_models --model-id facebook/sam-audio-small
# Vision Encoder
python -m onnx_export.export_vision --model facebook/sam-audio-small --output ./onnx_models
# PEAFrame Span Predictor
python -m onnx_export.export_peaframe --output-dir ./onnx_models --verify
# CLAP Reranking (audio + text encoders)
python -m onnx_export.export_clap --output-dir ./onnx_models --verify
FP16 Quantization (for large models)
For the large model (sam-audio-large), use --fp16 --device cuda during DiT export to reduce size by 50%:
# Export DiT in FP16 (11.7GB โ 5.9GB)
python -m onnx_export.export_dit \
--output-dir ./onnx_models_large_fp16 \
--model-id facebook/sam-audio-large \
--fp16 \
--device cuda
The inference script automatically detects FP16 models and handles input conversion.
Export Scripts Reference
| Script | Description |
|---|---|
export_all.py |
Export all components at once |
export_dit.py |
DiT transformer with FP16 support |
export_dacvae.py |
DACVAE encoder and decoder |
export_t5.py |
T5 text encoder |
export_vision.py |
Vision encoder (CLIP-based) |
export_peaframe.py |
PEAFrame span predictor + tokenizer |
export_clap.py |
CLAP audio + text encoders for reranking |
standalone_config.py |
Config classes for standalone export |
License
SAM-Audio is released under the CC-BY-NC 4.0 license. See original repository for full terms.
Acknowledgments
Original model by Meta AI Research.
Model tree for matbee/sam-audio-small-onnx
Base model
facebook/sam-audio-small