EfficientSAM3: Progressive Hierarchical Knowledge Distillation (PhD) from SAM1, 2 and 3

Chengxi Simon Zeng^1,∗, Yuxuan Jiang¹, Ge Gao¹, Shuai Wang², Duolikun Danier³, Bin Zhu⁴, Stevan Rudinac², David Bull¹, Fan Aaron Zhang^1,†

¹Visual Information Lab, University of Bristol; ²University of Amsterdam; ³School of Informatics, University of Edinburgh; ⁴Singapore Management University

^∗Primary Contributor ^†Corresponding Author

📄 Paper

Updates

[2026/02/18] SAM3-LiteText released! SAM3-LiteText reduces text encoder parameters by up to 88% with similar performance to the original text encoder. Paper available on arXiv.
[2026/01/11] Stage 1 geometry-prompt fine-tuned (ft) weights released/updated (image encoders on 1% SA-1B; text encoders fine-tuned on SA-Co Gold+Silver).
[2025/12/08] Stage 1 text encoder weights released for all 3 variants (MobileCLIP S0, S1, and MobileCLIP2 L) - distilled on 1% Recap-DataComp-1B dataset.
[2025/12/02] Stage 1 image encoder weights released for all 9 variants (RepViT, TinyViT, EfficientViT) - unsupervised distilled on 1% of SA-1B dataset.
[2025/11/25] Teaser model released. See Above. More models are baking in the oven🔥.
[2025/10/18] Project announced. Code and weights are not released yet; they will be published once SAM3 code is publicly available.

Table of Contents
Updates
Installation
Inference
Training and Evaluation
Datasets
EfficientSAM3 Model Zoo & Weight Release
Preliminary Evaluation
CoreML / ONNX Export
Web Demo
Development To-Do List
Call for Pull Requests
Citation
License
Acknowledgments
Users

SAM3 (Segment Anything Model 3) has introduced powerful Promptable Concept Segmentation (PCS) capabilities, enabling semantic understanding and temporal object tracking beyond traditional mask generation. However, SAM3's massive vision backbone and dense memory bank make it impractical for real-time, on-device applications where computational resources and latency constraints are critical.

EfficientSAM3 addresses this challenge by distilling SAM3's capabilities into lightweight architectures suitable for edge devices, enabling high-quality concept segmentation on mobile phones, embedded systems, and resource-constrained platforms.

EfficientSAM3 Architecture

Supported Models and Architecture

Component	Model/Backbone	Purpose
Teacher Models	SAM (Segment Anything Model)	Foundation for image-level encoder distillation
	SAM2	Temporal memory and video tracking distillation
	SAM3	Promptable Concept Segmentation (PCS) capabilities
Datasets	SA-1B	Image segmentation dataset
	SA-V	Video object segmentation dataset
	SA-Co/Gold	Promptable concept segmentation benchmark
	Recap-DataComp-1B	Large-scale image-text dataset for text encoder distillation
Student Backbones (Image)	RepViT (M0.9, M1.1, M2.3)	Mobile-optimized Vision Transformer for highest throughput
	TinyViT (5M, 11M, 21M)	Balanced efficiency and performance
	EfficientViT (B0, B1, B2)	Ultra-lightweight architectures for minimal latency
Student Backbones (Text)	MobileCLIP S0	Lightweight text encoder (42.57M params)
	MobileCLIP S1	Balanced text encoder (63.56M params)
	MobileCLIP2 L	Larger text encoder (123.6M params)

Three-Stage Progressive Training Curriculum

EfficientSAM3 is trained through a three-stage progressive distillation:

Stage 1: Encoder Distillation (Image-Level Segmentation)

Distill the SAM3 image encoder to nine student backbones (3 RepViT × 3 TinyViT × 3 EfficientViT variants)
Distill the SAM3 text encoder to three student text encoders (MobileCLIP S0, S1, 2-L variants)
Use SA-1B dataset with Prompt-in-the-Loop Distillation for image encoder distillation
Use Recap-DataComp-1B dataset for text encoder distillation
Align student backbone features with teacher encoder outputs.

Stage 2: Temporal Memory Distillation (Video Tracking)

Replace SAM3's dense memory bank with a compact Perceiver-based memory module (adapted from EdgeTAM)
Distill memory-conditioned mask predictions using SA-V dataset
Train the Perceiver module to compress and retrieve spatiotemporal features efficiently

Stage 3: End-to-End Fine-Tuning (Concept Segmentation)

Refine the complete EfficientSAM3 pipeline using SAM3 official dataset
Joint optimization of distilled encoder + compressed memory + mask decoder
Preserve Promptable Concept Segmentation capabilities while maintaining efficiency

tl;dr

Stage 1: We distill the SAM3 encoder using SAM1 data.
Stage 2: We align the distilled encoder to a perceiver and an efficient memory bank using SAM2 data.
Stage 3: We fine-tune the complete pipeline using SAM3 data.

Installation

EfficientSAM3 purposely shares the same software contract as upstream SAM3:

Python ≥ 3.12
PyTorch 2.7.0 (CUDA 12.6 build recommended)
CUDA-capable GPUs with drivers that support CUDA ≥ 12.6

Follow the exact environment setup from the official SAM3 README or use the condensed steps below:

git clone https://github.com/SimonZeng7108/efficientsam3.git
cd efficientsam3

conda create -n efficientsam3 python=3.12 -y
conda activate efficientsam3

pip install --upgrade pip
pip install torch==2.7.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126

# Install repo dependencies via the root pyproject (brings in SAM3 + Stage-1 extras)
pip install -e ".[stage1]"

# Note: the Stage-1 extra includes the SAM1 package dependency
# (PyPI name: segment-anything, import name: segment_anything).
# If your environment cannot resolve it from PyPI, install the vendored repo instead:
# pip install -e ./segment-anything

Inference

Download checkpoints from the Model Zoo section. All Stage 1 image encoder weights are available via Google Drive and Hugging Face links in the table below.

Quick Start (Image Segmentation):

🔥 Teaser Image Model

EfficientViT-S (0.68M params) distilled from SAM3 Encoder (461.84M) — 99.85% smaller, trained on 1% SA-1B.

from sam3.model_builder import build_efficientsam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor

# Load model
model = build_efficientsam3_image_model(
  checkpoint_path="efficient_sam3_efficientvit_s.pt",
  backbone_type="efficientvit",
  model_name="b0",
  enable_inst_interactivity=True,
)

# Process image and predict
processor = Sam3Processor(model)
inference_state = processor.set_image(image)

# Single positive point prompt (x, y) in pixels
points = [[image.size[0] / 2, image.size[1] / 2]]
labels = [1]
masks, scores, _ = model.predict_inst(
    inference_state, 
    point_coords=points, 
    point_labels=labels
)

🔥 Teaser Text Prompt Model

MobileCLIP-S1 (63.56M) distilled from SAM3 Text Encoder (353.72M) — trained on 1% Recap-DataComp-1B.

from sam3.model_builder import build_efficientsam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor

# Load model with text encoder
model = build_efficientsam3_image_model(
    checkpoint_path="efficient_sam3_tinyvit_m_mobileclip_s1.pt",
    backbone_type="tinyvit",
    model_name="11m",
    text_encoder_type="MobileCLIP-S1"
)

# Process image and predict with text prompt
processor = Sam3Processor(model)
inference_state = processor.set_image(image)
inference_state = processor.set_text_prompt(prompt="shoe", state=inference_state)
masks = inference_state["masks"]
scores = inference_state["scores"]
print(len(scores), scores)

🔥 SAM3-LiteText Model

Build a SAM3-LiteText model with a single call — the builder handles text encoder creation, checkpoint loading, and context length truncation internally.

from sam3.model_builder import build_sam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor

# Build SAM3-LiteText model
# Supported text_encoder_type: "MobileCLIP-S0", "MobileCLIP-S1", "MobileCLIP2-L"
# Supported text_encoder_context_length: 16, 32, or 77
model = build_sam3_image_model(
    checkpoint_path="efficient_sam3_image_encoder_mobileclip_s1_ctx32.pt",
    load_from_HF=False,
    text_encoder_type="MobileCLIP-S1",
    text_encoder_context_length=16,
    device='cuda',
)

# Run inference
processor = Sam3Processor(model, device='cuda', confidence_threshold=0.4)
state = processor.set_image(image)
state = processor.set_text_prompt("shoe", state)
masks = state["masks"]
scores = state["scores"]

For detailed examples including point/box prompts, batched inference, and more, see sam3/efficientsam3_examples/efficientsam3_for_sam1_task_example.py. For text prompt inference, see sam3/efficientsam3_examples/efficientsam3_image_predictor_example.ipynb. For SAM3-LiteText inference examples, see sam3/efficientsam3_examples/efficientsam3_litetext_image_inference_example.py (image) and sam3/efficientsam3_examples/efficientsam3_litetext_video_predictor_example.ipynb (video).

Training and Evaluation

Training:

For Stage 1 encoder distillation training details, see README_stage1.md. For Stage 1 geometry fine-tuning, check the stage1_geometry_finetune branch.
Stage 2 and Stage 3 training details coming soon.

Evaluation:

To evaluate models on COCO dataset:

python eval/eval_coco.py --coco_root data/coco --output_dir output

To evaluate text encoder quality (token-level cosine similarity vs SAM3 teacher):

python eval/eval_text_encoder_similarity.py \
  --student-ckpt /path/to/student_text_encoder_1.pth /path/to/student_text_encoder_2.pth \
  --np-json data/sa-v-text/sa-co-veval/saco_veval_noun_phrases.json \
  --device cuda
# Optional: override teacher checkpoint
#   --teacher-ckpt /path/to/sam3_teacher_checkpoint.pt

Datasets

For dataset setup and download scripts (data/download_*.sh) covering COCO, DAVIS, LVIS, SA-1B, SA-V, LVOS, MOSE, and YouTube-VOS, see:

README_dataset.md

EfficientSAM3 Model Zoo & Weight Release

SAM3 Text Encoder + EfficientSAM3 Image Encoder Models

Model Name	Backbone	Parameters	Stage 1 Weights (Encoder Distilled)	Stage 2 Weights (Memory Module Trained)	Stage 3 Weights (End-to-End Fine-Tuned)
ES-RV-S	RepViT-M0.9	4.72M	HF	$$\text{Planned}$$	$$\text{Planned}$$
ES-RV-M	RepViT-M1.1	7.77M	HF (ft: HF)	$$\text{Planned}$$	$$\text{Planned}$$
ES-RV-L	RepViT-M2.3	22.40M	HF	$$\text{Planned}$$	$$\text{Planned}$$
ES-TV-S	TinyViT-5M	5.07M	HF	$$\text{Planned}$$	$$\text{Planned}$$
ES-TV-M	TinyViT-11M	10.55M	HF (ft: HF)	$$\text{Planned}$$	$$\text{Planned}$$
ES-TV-L	TinyViT-21M	20.62M	HF	$$\text{Planned}$$	$$\text{Planned}$$
ES-EV-S	EfficientViT-B0	0.68M	HF	$$\text{Planned}$$	$$\text{Planned}$$
ES-EV-M	EfficientViT-B1	4.64M	HF (ft: HF)	$$\text{Planned}$$	$$\text{Planned}$$
ES-EV-L	EfficientViT-B2	14.98M	HF	$$\text{Planned}$$	$$\text{Planned}$$

Note (2025/12/02): The current Stage 1 image encoder weights are distilled on 1% of the SA-1B dataset.

Note (2026/01/11): The fine-tuned (ft) models use geometry-prompt fine-tuning on the same 1% subset of SA-1B; see training details in the stage1_geometry_finetune branch.

EfficientSAM3 Text Encoder + EfficientSAM3 Image Encoder Models

Model Name	Backbone	Parameters	Stage 1 Weights (Encoder Distilled)	Stage 2 Weights (Memory Module Trained)	Stage 3 Weights (End-to-End Fine-Tuned)
ES-RV-S-MC-S1	RepViT-M0.9 & MobileCLIP-S1	4.72M + 63.56M	HF	$$\text{Planned}$$	$$\text{Planned}$$
ES-RV-M-MC-S1	RepViT-M1.1 & MobileCLIP-S1	7.77M + 63.56M	HF (ft: HF)	$$\text{Planned}$$	$$\text{Planned}$$
ES-RV-L-MC-S1	RepViT-M2.3 & MobileCLIP-S1	22.40M + 63.56M	HF	$$\text{Planned}$$	$$\text{Planned}$$
ES-TV-S-MC-S1	TinyViT-5M & MobileCLIP-S1	5.07M + 63.56M	HF	$$\text{Planned}$$	$$\text{Planned}$$
ES-TV-M-MC-S1	TinyViT-11M & MobileCLIP-S1	10.55M + 63.56M	HF (ft: HF)	$$\text{Planned}$$	$$\text{Planned}$$
ES-TV-L-MC-S1	TinyViT-21M & MobileCLIP-S1	20.62M + 63.56M	HF	$$\text{Planned}$$	$$\text{Planned}$$
ES-EV-S-MC-S1	EfficientViT-B0 & MobileCLIP-S1	0.68M + 63.56M	HF	$$\text{Planned}$$	$$\text{Planned}$$
ES-EV-M-MC-S1	EfficientViT-B1 & MobileCLIP-S1	4.64M + 63.56M	HF (ft: HF)	$$\text{Planned}$$	$$\text{Planned}$$
ES-EV-L-MC-S1	EfficientViT-B2 & MobileCLIP-S1	14.98M + 63.56M	HF	$$\text{Planned}$$	$$\text{Planned}$$

Note (2025/12/08): The current Stage 1 text encoder weights are distilled on 1% of the Recap-DataComp-1B dataset combined with all 9 image encoder variants. We notice a performance degradation, this is expected as the text encoder are not aligning with the light image encoders in stage1. We will release the stage1+ fine-tuned weights in the future.

Note (2025/12/08): We have also uploaded standalone text encoder weights trained on 1% Recap-DataComp-1B dataset: MobileCLIP-S1 and MobileCLIP2-L. You can merge with stage 1 trained image encoder weights to get the full model.

Note (2026/01/11): The fine-tuned (ft) text encoder models are fine-tuned on SA-Co Gold+Silver text annotations. Standalone fine-tuned text encoder weights: MobileCLIP-S0, MobileCLIP-S1, and MobileCLIP2-L.

SAM3-LiteText Models

SAM3-LiteText replaces the SAM3 text encoder with a lightweight distilled text encoder, reducing text encoder parameters by up to 88% with comparable performance. See the SAM3-LiteText paper for details.

Model	Text Encoder	Ctx	Text Params	Weights
SAM3-LiteText-S0-16	MobileCLIP-S0	16	42.54M	HF
SAM3-LiteText-S1-16	MobileCLIP-S1	16	63.53M	HF
SAM3-LiteText-L-16	MobileCLIP2-L	16	123.80M	HF

All models use the SAM3 ViT-H image encoder (353.72M vision params). The text encoder parameters shown represent the distilled student replacing the original 353.72M text encoder, achieving up to 88% parameter reduction.

Preliminary Evaluation

Stage 1 Image Model Evaluation Results (COCO val2017)

Model Name	Backbone	Parameters	COCO mIoU	Test Time (s)
ES-RV-S	RepViT-M0.9	4.72M	64.80%	407.23
ES-RV-M	RepViT-M1.1	7.77M	65.28% (ft 65.60%)	413.38
ES-RV-L	RepViT-M2.3	22.40M	65.53%	466.66
ES-TV-S	TinyViT-5M	5.07M	65.51%	430.52
ES-TV-M	TinyViT-11M	10.55M	65.45% (ft 65.69%)	443.45
ES-TV-L	TinyViT-21M	20.62M	66.29%	452.14
ES-EV-S	EfficientViT-B0	0.68M	61.62%	419.57
ES-EV-M	EfficientViT-B1	4.64M	64.82% (ft 64.94%)	434.45
ES-EV-L	EfficientViT-B2	14.98M	66.30%	450.36

Note: The evaluation is done with a single NVIDIA 4070 Ti.

Stage 1 Text Encoder Evaluation Results (SA-Co/VEval Noun Phrases)

Metric: average token-level cosine similarity between student text features and SAM3 text encoder features.

Pretrained on 1% Recap-DataComp-1B

Model Name	Text Backbone	Avg Cos Similarity	Eval Set
ES-MC-S0 (Recap-DC1B 1% pt)	MobileCLIP-S0	0.864846	5184 noun phrases
ES-MC-S1 (Recap-DC1B 1% pt)	MobileCLIP-S1	0.854405	5184 noun phrases
ES-MC2-L (Recap-DC1B 1% pt)	MobileCLIP2-L	0.850976	5184 noun phrases

Fine-tuned on SA-Co Gold+Silver text annotations

Model Name	Text Backbone	Avg Cos Similarity	Eval Set
ES-MC-S0 (SA-Co ft)	MobileCLIP-S0	0.938915	5184 noun phrases
ES-MC-S1 (SA-Co ft)	MobileCLIP-S1	0.947152	5184 noun phrases
ES-MC2-L (SA-Co ft)	MobileCLIP2-L	0.952901	5184 noun phrases

Note: Evaluation is done with eval_text_encoder_similarity.py using data/sa-v-text/sa-co-veval/saco_veval_noun_phrases.json. Pretrained models are trained on Recap-DataComp-1B (1%), and fine-tuned models are trained on SA-Co Gold+Silver text annotations.

SAM3-LiteText Evaluation Results (SA-Co/Gold, Metric: CG_F1)

Model	Ctx	MetaClip	SA1B	Crowd	Food	SptEq	Attr	Wiki	Avg F1	MCC	pmF1
gDino-T	-	2.9	3.1	0.28	0.96	1.1	13.8	0.70	3.3	0.15	16.2
OWLv2	-	12.2	9.8	8.9	24.4	24.4	25.9	15.4	17.3	0.46	36.8
LLMDet-L	-	4.5	5.3	2.4	5.5	4.4	22.2	1.2	6.5	0.21	27.3
APE-D	-	12.6	2.2	7.2	22.7	31.8	26.7	11.6	16.4	0.40	36.9
DINO-X	-	17.2	19.7	12.9	30.1	28.4	31.0	9.7	21.3	0.38	55.2
Gemini 2.5	-	9.9	13.1	8.2	19.6	15.1	18.8	6.5	13.0	0.29	46.1
SAM3	77	47.3	53.7	61.1	53.4	65.5	54.9	42.5	54.1	0.82	66.1
SAM3-LiteText-S0	16	47.06	53.42	60.58	52.18	65.05	54.86	42.12	53.61	0.81	65.54
SAM3-LiteText-S1	16	47.18	53.58	60.76	52.43	65.28	55.02	42.35	53.80	0.81	65.72
SAM3-LiteText-L	16	47.24	53.66	60.88	52.65	65.49	55.19	42.54	53.95	0.81	65.87

Note: This table shows performance of the released ctx-16 models, which were trained with a more extensive dataset mixture compared to the models reported in the paper. As a result, performance may differ slightly from the values in the associated publication.

CoreML / ONNX Export

Coming soon: export pipelines to ONNX and CoreML for cross-platform deployment.

Web Demo

Coming soon: an interactive web demo for real-time concept segmentation and tracking.

Development To-Do List

Release Stage 1 Image Encoder Weights: Distilled image encoder weights from SAM3 image encoder for all 9 variants (RepViT, TinyViT, EfficientViT)
Release Stage 1 Text Encoder Weights: Distill SAM3 text encoder weights to MobileCLIP-S1 combined with all 9 image encoder variants
Release Stage 1+ Fine-Tuned Encoder Weights: Prompt-in-the-loop supervised fine-tuning for improved encoder performance
Release SAM3-LiteText Weights: Distilled a lightweight MobileCLIP text encoder that is competitive to the SAM3 text encoder for efficient vision-language segmentation
Release Stage 2 Memory Bank Aligned Model Weights: Models with Perceiver-based memory compression trained on SA-V dataset
Release Stage 3 Fine-Tuned Model Weights: End-to-end fine-tuned models on SAM3 dataset with full PCS capabilities
ONNX/CoreML Export: Export models to ONNX and CoreML formats for cross-platform deployment
Web Demo: Interactive web demonstration for real-time concept segmentation and tracking

Call for Pull Requests

The idea for this repository originated from my work on SAM2 at Amazon, particularly as part of the research described in this paper. Since company policy, I cannot share the codebase. This year I am super excited to work on making SAM3 more efficient and accessible to the community.

We welcome contributions to EfficientSAM3! Please feel free to submit pull requests to improve the codebase, add new features, or fix bugs. Particularly, we are looking for: - Efficient MedSAM3 integration (see MedSAM2 by Bo Wang Lab) - A Gradio demo (e.g. EfficientTAM on Hugging Face Spaces) - A web demo deployed with Vercel (e.g. Segment Anything Web UI) - Annotation tools, such as X-AnyLabeling and AnyLabeling - An iOS or Android app (e.g. Cutcha Photo on the App Store) - An NVCC-based desktop application - Anything else that you think is cool!

All meaningful contributions will be acknowledged and integrated into both the repository and the associated paper. We warmly welcome all contributors to the repository and happily offer co-authorship to those whose work merits inclusion in the paper.

Citation

If you use EfficientSAM3 in your research, please cite:

@misc{zeng2025efficientsam3progressivehierarchicaldistillation,
      title={EfficientSAM3: Progressive Hierarchical Distillation for Video Concept Segmentation from SAM1, 2, and 3}, 
      author={Chengxi Zeng and Yuxuan Jiang and Aaron Zhang},
      year={2025},
      eprint={2511.15833},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.15833}, 
}

@misc{zeng2026sam3litetextanatomicalstudysam3,
      title={SAM3-LiteText: An Anatomical Study of the SAM3 Text Encoder for Efficient Vision-Language Segmentation}, 
      author={Chengxi Zeng and Yuxuan Jiang and Ge Gao and Shuai Wang and Duolikun Danier and Bin Zhu and Stevan Rudinac and David Bull and Fan Zhang},
      year={2026},
      eprint={2602.12173},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2602.12173}, 
}

License

This repository is licensed under the Apache 2.0 License.

This project builds upon SAM, SAM2, SAM3, EdgeSAM, EdgeTAM, EfficientTAM, RepViT, TinyViT, EfficientViT, and MobileCLIP. Please refer to their respective licenses for usage terms.

Acknowledgments

We gratefully acknowledge the University of Bristol Isambard-AI supercomputer cluster for providing computational resources to this project. Special thanks to Dr. Fan Aaron Zhang for allocating resources and supporting this research.

Users

Organizations and projects using EfficientSAM3:

European Space Agency

Note: If you're using EfficientSAM3 in your work, please acknowledge us in your publications or projects. We're happy to promote your work here! Contact us to be featured in this section.

Downloads last month: -; Downloads are not tracked for this model. How to track

Papers for Simon7108528/EfficientSAM3

SAM3-LiteText: An Anatomical Study of the SAM3 Text Encoder for Efficient Vision-Language Segmentation

Paper • 2602.12173 • Published 6 days ago

EfficientSAM3: Progressive Hierarchical Distillation for Video Concept Segmentation from SAM1, 2, and 3

Paper • 2511.15833 • Published Nov 19, 2025