sam3.1-bf16

This model was converted to MLX format from facebook/sam3.1 using mlx-vlm version 0.4.3.

Open-vocabulary object detection, instance segmentation, and video tracking with Object Multiplex on Apple Silicon (~873M parameters).

SAM 3.1 extends SAM 3 with:

MultiplexMaskDecoder: processes 16 objects simultaneously (2.4-4x faster tracking)
TriViTDetNeck: 3 parallel FPN heads (detection, interactive, propagation)
DecoupledMemoryAttention: image cross-attention with RoPE
Improved detection accuracy (0.90 vs 0.87 on cats benchmark)

Quick Start

pip install mlx-vlm>=0.4.3

from PIL import Image
from mlx_vlm.utils import load_model, get_model_path
from mlx_vlm.models.sam3.generate import Sam3Predictor
from mlx_vlm.models.sam3_1.processing_sam3_1 import Sam31Processor

model_path = get_model_path("mlx-community/sam3.1-bf16")
model = load_model(model_path)
processor = Sam31Processor.from_pretrained(str(model_path))
predictor = Sam3Predictor(model, processor, score_threshold=0.3)

Object Detection

image = Image.open("photo.jpg")
result = predictor.predict(image, text_prompt="a dog")

for i in range(len(result.scores)):
    x1, y1, x2, y2 = result.boxes[i]
    print(f"[{result.scores[i]:.2f}] box=({x1:.0f}, {y1:.0f}, {x2:.0f}, {y2:.0f})")

Instance Segmentation

result = predictor.predict(image, text_prompt="a person")

# result.boxes   -> (N, 4) xyxy bounding boxes
# result.masks   -> (N, H, W) binary segmentation masks
# result.scores  -> (N,) confidence scores

import numpy as np
overlay = np.array(image).copy()
W, H = image.size
for i in range(len(result.scores)):
    mask = result.masks[i]
    if mask.shape != (H, W):
        mask = np.array(Image.fromarray(mask.astype(np.float32)).resize((W, H)))
    binary = mask > 0
    overlay[binary] = (overlay[binary] * 0.5 + np.array([255, 0, 0]) * 0.5).astype(np.uint8)

Multi-Prompt Detection

from mlx_vlm.models.sam3_1.generate import predict_multi

result = predict_multi(predictor, image, ["a cat", "a remote control"])
for i in range(len(result.scores)):
    x1, y1, x2, y2 = result.boxes[i]
    print(f"[{result.scores[i]:.2f}] {result.labels[i]} box=({x1:.0f}, {y1:.0f}, {x2:.0f}, {y2:.0f})")

Box-Guided Detection

import numpy as np
boxes = np.array([[100, 50, 400, 350]])  # xyxy pixel coords
result = predictor.predict(image, text_prompt="a cat", boxes=boxes)

CLI

# Object detection
python -m mlx_vlm.models.sam3_1.generate --task detect --image photo.jpg --prompt "a cat" --model mlx-community/sam3.1-bf16

# Instance segmentation
python -m mlx_vlm.models.sam3_1.generate --image photo.jpg --prompt "a cat" --model mlx-community/sam3.1-bf16

# Video tracking
python -m mlx_vlm.models.sam3_1.generate --task track --video input.mp4 --prompt "a car" --model mlx-community/sam3.1-bf16

# Real-time webcam (optimized: backbone caching + tracker propagation)
python -m mlx_vlm.models.sam3_1.generate --task realtime --prompt "a person" --model mlx-community/sam3.1-bf16 --resolution 224

Flag	Default	Description
`--task`	`segment`	`detect`, `segment`, `track`, `realtime`
`--prompt`	(required)	Text prompt(s), supports multiple
`--resolution`	`1008`	Input resolution (224 for faster realtime)
`--detect-every`	`15`	Re-run full detection every N frames
`--backbone-every`	`30`	Re-run ViT backbone every N frames

Benchmarks (M3 Max, bf16)

Detection Accuracy

Prompt	SAM 3	SAM 3.1
"a cat" (2 cats)	0.87, 0.82	0.90, 0.86
"a remote control"	0.95, 0.94	0.94, 0.94

Tracker Multiplex Speed

Objects	SAM 3	SAM 3.1	Speedup
3	547ms/frame	227ms/frame	2.4x
4	608ms/frame	203ms/frame	3.0x
5	766ms/frame	190ms/frame	4.0x

Optimized Realtime (224px)

Metric	Value
Cached frame	38ms (26 FPS)
Sustained average	~40ms (25 FPS)
Baseline (no optimization)	~212ms (5 FPS)
Total speedup	4.6x

Original Model

facebook/sam3.1 · Code

License

The original SAM 3.1 model weights are released by Meta under the SAM License, a custom permissive license for commercial and research use.

Downloads last month: 106

Safetensors

Model size

0.9B params

Tensor type

F32

MLX

Hardware compatibility

Quantized

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlx-community/sam3.1-bf16

Base model

facebook/sam3.1

Finetuned

(1)

this model