EfficientSAM3: Progressive Hierarchical Knowledge Distillation (PhD) from SAM1, 2 and 3
Chengxi Simon Zeng1,β, Yuxuan Jiang1, Ge Gao1, Shuai Wang2, Duolikun Danier3, Bin Zhu4, Stevan Rudinac2, David Bull1, Fan Aaron Zhang1,β
1Visual Information Lab, University of Bristol; 2University of Amsterdam; 3School of Informatics, University of Edinburgh; 4Singapore Management University
βPrimary Contributor β Corresponding Author
π Paper
Updates
- [2026/02/18] SAM3-LiteText released! SAM3-LiteText reduces text encoder parameters by up to 88% with similar performance to the original text encoder. Paper available on arXiv.
- [2026/01/11] Stage 1 geometry-prompt fine-tuned (ft) weights released/updated (image encoders on 1% SA-1B; text encoders fine-tuned on SA-Co Gold+Silver).
- [2025/12/08] Stage 1 text encoder weights released for all 3 variants (MobileCLIP S0, S1, and MobileCLIP2 L) - distilled on 1% Recap-DataComp-1B dataset.
- [2025/12/02] Stage 1 image encoder weights released for all 9 variants (RepViT, TinyViT, EfficientViT) - unsupervised distilled on 1% of SA-1B dataset.
- [2025/11/25] Teaser model released. See Above. More models are baking in the ovenπ₯.
- [2025/10/18] Project announced. Code and weights are not released yet; they will be published once SAM3 code is publicly available.
Table of Contents
- Table of Contents
- Updates
- Installation
- Inference
- Training and Evaluation
- Datasets
- EfficientSAM3 Model Zoo & Weight Release
- Preliminary Evaluation
- CoreML / ONNX Export
- Web Demo
- Development To-Do List
- Call for Pull Requests
- Citation
- License
- Acknowledgments
- Users
SAM3 (Segment Anything Model 3) has introduced powerful Promptable Concept Segmentation (PCS) capabilities, enabling semantic understanding and temporal object tracking beyond traditional mask generation. However, SAM3's massive vision backbone and dense memory bank make it impractical for real-time, on-device applications where computational resources and latency constraints are critical.
EfficientSAM3 addresses this challenge by distilling SAM3's capabilities into lightweight architectures suitable for edge devices, enabling high-quality concept segmentation on mobile phones, embedded systems, and resource-constrained platforms.
Supported Models and Architecture
| Component | Model/Backbone | Purpose |
|---|---|---|
| Teacher Models | SAM (Segment Anything Model) | Foundation for image-level encoder distillation |
| SAM2 | Temporal memory and video tracking distillation | |
| SAM3 | Promptable Concept Segmentation (PCS) capabilities | |
| Datasets | SA-1B | Image segmentation dataset |
| SA-V | Video object segmentation dataset | |
| SA-Co/Gold | Promptable concept segmentation benchmark | |
| Recap-DataComp-1B | Large-scale image-text dataset for text encoder distillation | |
| Student Backbones (Image) | RepViT (M0.9, M1.1, M2.3) | Mobile-optimized Vision Transformer for highest throughput |
| TinyViT (5M, 11M, 21M) | Balanced efficiency and performance | |
| EfficientViT (B0, B1, B2) | Ultra-lightweight architectures for minimal latency | |
| Student Backbones (Text) | MobileCLIP S0 | Lightweight text encoder (42.57M params) |
| MobileCLIP S1 | Balanced text encoder (63.56M params) | |
| MobileCLIP2 L | Larger text encoder (123.6M params) |
Three-Stage Progressive Training Curriculum
EfficientSAM3 is trained through a three-stage progressive distillation:
Stage 1: Encoder Distillation (Image-Level Segmentation)
- Distill the SAM3 image encoder to nine student backbones (3 RepViT Γ 3 TinyViT Γ 3 EfficientViT variants)
- Distill the SAM3 text encoder to three student text encoders (MobileCLIP S0, S1, 2-L variants)
- Use SA-1B dataset with Prompt-in-the-Loop Distillation for image encoder distillation
- Use Recap-DataComp-1B dataset for text encoder distillation
- Align student backbone features with teacher encoder outputs.
Stage 2: Temporal Memory Distillation (Video Tracking)
- Replace SAM3's dense memory bank with a compact Perceiver-based memory module (adapted from EdgeTAM)
- Distill memory-conditioned mask predictions using SA-V dataset
- Train the Perceiver module to compress and retrieve spatiotemporal features efficiently
Stage 3: End-to-End Fine-Tuning (Concept Segmentation)
- Refine the complete EfficientSAM3 pipeline using SAM3 official dataset
- Joint optimization of distilled encoder + compressed memory + mask decoder
- Preserve Promptable Concept Segmentation capabilities while maintaining efficiency
tl;dr
Stage 1: We distill the SAM3 encoder using SAM1 data.
Stage 2: We align the distilled encoder to a perceiver and an efficient memory bank using SAM2 data.
Stage 3: We fine-tune the complete pipeline using SAM3 data.
Installation
EfficientSAM3 purposely shares the same software contract as upstream SAM3:
- Python β₯ 3.12
- PyTorch 2.7.0 (CUDA 12.6 build recommended)
- CUDA-capable GPUs with drivers that support CUDA β₯ 12.6
Follow the exact environment setup from the official SAM3 README or use the condensed steps below:
git clone https://github.com/SimonZeng7108/efficientsam3.git
cd efficientsam3
conda create -n efficientsam3 python=3.12 -y
conda activate efficientsam3
pip install --upgrade pip
pip install torch==2.7.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
# Install repo dependencies via the root pyproject (brings in SAM3 + Stage-1 extras)
pip install -e ".[stage1]"
# Note: the Stage-1 extra includes the SAM1 package dependency
# (PyPI name: segment-anything, import name: segment_anything).
# If your environment cannot resolve it from PyPI, install the vendored repo instead:
# pip install -e ./segment-anything
Inference
Download checkpoints from the Model Zoo section. All Stage 1 image encoder weights are available via Google Drive and Hugging Face links in the table below.
Quick Start (Image Segmentation):
π₯ Teaser Image Model
EfficientViT-S (0.68M params) distilled from SAM3 Encoder (461.84M) β 99.85% smaller, trained on 1% SA-1B.
from sam3.model_builder import build_efficientsam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor
# Load model
model = build_efficientsam3_image_model(
checkpoint_path="efficient_sam3_efficientvit_s.pt",
backbone_type="efficientvit",
model_name="b0",
enable_inst_interactivity=True,
)
# Process image and predict
processor = Sam3Processor(model)
inference_state = processor.set_image(image)
# Single positive point prompt (x, y) in pixels
points = [[image.size[0] / 2, image.size[1] / 2]]
labels = [1]
masks, scores, _ = model.predict_inst(
inference_state,
point_coords=points,
point_labels=labels
)
π₯ Teaser Text Prompt Model
MobileCLIP-S1 (63.56M) distilled from SAM3 Text Encoder (353.72M) β trained on 1% Recap-DataComp-1B.
from sam3.model_builder import build_efficientsam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor
# Load model with text encoder
model = build_efficientsam3_image_model(
checkpoint_path="efficient_sam3_tinyvit_m_mobileclip_s1.pt",
backbone_type="tinyvit",
model_name="11m",
text_encoder_type="MobileCLIP-S1"
)
# Process image and predict with text prompt
processor = Sam3Processor(model)
inference_state = processor.set_image(image)
inference_state = processor.set_text_prompt(prompt="shoe", state=inference_state)
masks = inference_state["masks"]
scores = inference_state["scores"]
print(len(scores), scores)
π₯ SAM3-LiteText Model
Build a SAM3-LiteText model with a single call β the builder handles text encoder creation, checkpoint loading, and context length truncation internally.
from sam3.model_builder import build_sam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor
# Build SAM3-LiteText model
# Supported text_encoder_type: "MobileCLIP-S0", "MobileCLIP-S1", "MobileCLIP2-L"
# Supported text_encoder_context_length: 16, 32, or 77
model = build_sam3_image_model(
checkpoint_path="efficient_sam3_image_encoder_mobileclip_s1_ctx32.pt",
load_from_HF=False,
text_encoder_type="MobileCLIP-S1",
text_encoder_context_length=16,
device='cuda',
)
# Run inference
processor = Sam3Processor(model, device='cuda', confidence_threshold=0.4)
state = processor.set_image(image)
state = processor.set_text_prompt("shoe", state)
masks = state["masks"]
scores = state["scores"]
For detailed examples including point/box prompts, batched inference, and more, see sam3/efficientsam3_examples/efficientsam3_for_sam1_task_example.py. For text prompt inference, see sam3/efficientsam3_examples/efficientsam3_image_predictor_example.ipynb. For SAM3-LiteText inference examples, see sam3/efficientsam3_examples/efficientsam3_litetext_image_inference_example.py (image) and sam3/efficientsam3_examples/efficientsam3_litetext_video_predictor_example.ipynb (video).
Training and Evaluation
Training:
- For Stage 1 encoder distillation training details, see README_stage1.md. For Stage 1 geometry fine-tuning, check the
stage1_geometry_finetunebranch. - Stage 2 and Stage 3 training details coming soon.
Evaluation:
To evaluate models on COCO dataset:
python eval/eval_coco.py --coco_root data/coco --output_dir outputTo evaluate text encoder quality (token-level cosine similarity vs SAM3 teacher):
python eval/eval_text_encoder_similarity.py \ --student-ckpt /path/to/student_text_encoder_1.pth /path/to/student_text_encoder_2.pth \ --np-json data/sa-v-text/sa-co-veval/saco_veval_noun_phrases.json \ --device cuda # Optional: override teacher checkpoint # --teacher-ckpt /path/to/sam3_teacher_checkpoint.pt
Datasets
For dataset setup and download scripts (data/download_*.sh) covering COCO, DAVIS, LVIS, SA-1B, SA-V, LVOS, MOSE, and YouTube-VOS, see:
EfficientSAM3 Model Zoo & Weight Release
SAM3 Text Encoder + EfficientSAM3 Image Encoder Models
| Model Name | Backbone | Parameters | Stage 1 Weights (Encoder Distilled) |
Stage 2 Weights (Memory Module Trained) |
Stage 3 Weights (End-to-End Fine-Tuned) |
|---|---|---|---|---|---|
| ES-RV-S | RepViT-M0.9 | 4.72M | HF | $$\text{Planned}$$ | $$\text{Planned}$$ |
| ES-RV-M | RepViT-M1.1 | 7.77M | HF (ft: HF) | $$\text{Planned}$$ | $$\text{Planned}$$ |
| ES-RV-L | RepViT-M2.3 | 22.40M | HF | $$\text{Planned}$$ | $$\text{Planned}$$ |
| ES-TV-S | TinyViT-5M | 5.07M | HF | $$\text{Planned}$$ | $$\text{Planned}$$ |
| ES-TV-M | TinyViT-11M | 10.55M | HF (ft: HF) | $$\text{Planned}$$ | $$\text{Planned}$$ |
| ES-TV-L | TinyViT-21M | 20.62M | HF | $$\text{Planned}$$ | $$\text{Planned}$$ |
| ES-EV-S | EfficientViT-B0 | 0.68M | HF | $$\text{Planned}$$ | $$\text{Planned}$$ |
| ES-EV-M | EfficientViT-B1 | 4.64M | HF (ft: HF) | $$\text{Planned}$$ | $$\text{Planned}$$ |
| ES-EV-L | EfficientViT-B2 | 14.98M | HF | $$\text{Planned}$$ | $$\text{Planned}$$ |
Note (2025/12/02): The current Stage 1 image encoder weights are distilled on 1% of the SA-1B dataset.
Note (2026/01/11): The fine-tuned (ft) models use geometry-prompt fine-tuning on the same 1% subset of SA-1B; see training details in the
stage1_geometry_finetunebranch.
EfficientSAM3 Text Encoder + EfficientSAM3 Image Encoder Models
| Model Name | Backbone | Parameters | Stage 1 Weights (Encoder Distilled) |
Stage 2 Weights (Memory Module Trained) |
Stage 3 Weights (End-to-End Fine-Tuned) |
|---|---|---|---|---|---|
| ES-RV-S-MC-S1 | RepViT-M0.9 & MobileCLIP-S1 | 4.72M + 63.56M | HF | $$\text{Planned}$$ | $$\text{Planned}$$ |
| ES-RV-M-MC-S1 | RepViT-M1.1 & MobileCLIP-S1 | 7.77M + 63.56M | HF (ft: HF) | $$\text{Planned}$$ | $$\text{Planned}$$ |
| ES-RV-L-MC-S1 | RepViT-M2.3 & MobileCLIP-S1 | 22.40M + 63.56M | HF | $$\text{Planned}$$ | $$\text{Planned}$$ |
| ES-TV-S-MC-S1 | TinyViT-5M & MobileCLIP-S1 | 5.07M + 63.56M | HF | $$\text{Planned}$$ | $$\text{Planned}$$ |
| ES-TV-M-MC-S1 | TinyViT-11M & MobileCLIP-S1 | 10.55M + 63.56M | HF (ft: HF) | $$\text{Planned}$$ | $$\text{Planned}$$ |
| ES-TV-L-MC-S1 | TinyViT-21M & MobileCLIP-S1 | 20.62M + 63.56M | HF | $$\text{Planned}$$ | $$\text{Planned}$$ |
| ES-EV-S-MC-S1 | EfficientViT-B0 & MobileCLIP-S1 | 0.68M + 63.56M | HF | $$\text{Planned}$$ | $$\text{Planned}$$ |
| ES-EV-M-MC-S1 | EfficientViT-B1 & MobileCLIP-S1 | 4.64M + 63.56M | HF (ft: HF) | $$\text{Planned}$$ | $$\text{Planned}$$ |
| ES-EV-L-MC-S1 | EfficientViT-B2 & MobileCLIP-S1 | 14.98M + 63.56M | HF | $$\text{Planned}$$ | $$\text{Planned}$$ |
Note (2025/12/08): The current Stage 1 text encoder weights are distilled on 1% of the Recap-DataComp-1B dataset combined with all 9 image encoder variants. We notice a performance degradation, this is expected as the text encoder are not aligning with the light image encoders in stage1. We will release the stage1+ fine-tuned weights in the future.
Note (2025/12/08): We have also uploaded standalone text encoder weights trained on 1% Recap-DataComp-1B dataset: MobileCLIP-S1 and MobileCLIP2-L. You can merge with stage 1 trained image encoder weights to get the full model.
Note (2026/01/11): The fine-tuned (ft) text encoder models are fine-tuned on SA-Co Gold+Silver text annotations. Standalone fine-tuned text encoder weights: MobileCLIP-S0, MobileCLIP-S1, and MobileCLIP2-L.
SAM3-LiteText Models
SAM3-LiteText replaces the SAM3 text encoder with a lightweight distilled text encoder, reducing text encoder parameters by up to 88% with comparable performance. See the SAM3-LiteText paper for details.
| Model | Text Encoder | Ctx | Text Params | Weights |
|---|---|---|---|---|
| SAM3-LiteText-S0-16 | MobileCLIP-S0 | 16 | 42.54M | HF |
| SAM3-LiteText-S1-16 | MobileCLIP-S1 | 16 | 63.53M | HF |
| SAM3-LiteText-L-16 | MobileCLIP2-L | 16 | 123.80M | HF |
All models use the SAM3 ViT-H image encoder (353.72M vision params). The text encoder parameters shown represent the distilled student replacing the original 353.72M text encoder, achieving up to 88% parameter reduction.
Preliminary Evaluation
Stage 1 Image Model Evaluation Results (COCO val2017)
| Model Name | Backbone | Parameters | COCO mIoU | Test Time (s) |
|---|---|---|---|---|
| ES-RV-S | RepViT-M0.9 | 4.72M | 64.80% | 407.23 |
| ES-RV-M | RepViT-M1.1 | 7.77M | 65.28% (ft 65.60%) | 413.38 |
| ES-RV-L | RepViT-M2.3 | 22.40M | 65.53% | 466.66 |
| ES-TV-S | TinyViT-5M | 5.07M | 65.51% | 430.52 |
| ES-TV-M | TinyViT-11M | 10.55M | 65.45% (ft 65.69%) | 443.45 |
| ES-TV-L | TinyViT-21M | 20.62M | 66.29% | 452.14 |
| ES-EV-S | EfficientViT-B0 | 0.68M | 61.62% | 419.57 |
| ES-EV-M | EfficientViT-B1 | 4.64M | 64.82% (ft 64.94%) | 434.45 |
| ES-EV-L | EfficientViT-B2 | 14.98M | 66.30% | 450.36 |
Note: The evaluation is done with a single NVIDIA 4070 Ti.
Stage 1 Text Encoder Evaluation Results (SA-Co/VEval Noun Phrases)
Metric: average token-level cosine similarity between student text features and SAM3 text encoder features.
Pretrained on 1% Recap-DataComp-1B
| Model Name | Text Backbone | Avg Cos Similarity | Eval Set |
|---|---|---|---|
| ES-MC-S0 (Recap-DC1B 1% pt) | MobileCLIP-S0 | 0.864846 | 5184 noun phrases |
| ES-MC-S1 (Recap-DC1B 1% pt) | MobileCLIP-S1 | 0.854405 | 5184 noun phrases |
| ES-MC2-L (Recap-DC1B 1% pt) | MobileCLIP2-L | 0.850976 | 5184 noun phrases |
Fine-tuned on SA-Co Gold+Silver text annotations
| Model Name | Text Backbone | Avg Cos Similarity | Eval Set |
|---|---|---|---|
| ES-MC-S0 (SA-Co ft) | MobileCLIP-S0 | 0.938915 | 5184 noun phrases |
| ES-MC-S1 (SA-Co ft) | MobileCLIP-S1 | 0.947152 | 5184 noun phrases |
| ES-MC2-L (SA-Co ft) | MobileCLIP2-L | 0.952901 | 5184 noun phrases |
Note: Evaluation is done with eval_text_encoder_similarity.py using
data/sa-v-text/sa-co-veval/saco_veval_noun_phrases.json. Pretrained models are trained on Recap-DataComp-1B (1%), and fine-tuned models are trained on SA-Co Gold+Silver text annotations.
SAM3-LiteText Evaluation Results (SA-Co/Gold, Metric: CG_F1)
| Model | Ctx | MetaClip | SA1B | Crowd | Food | SptEq | Attr | Wiki | Avg F1 | MCC | pmF1 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| gDino-T | - | 2.9 | 3.1 | 0.28 | 0.96 | 1.1 | 13.8 | 0.70 | 3.3 | 0.15 | 16.2 |
| OWLv2 | - | 12.2 | 9.8 | 8.9 | 24.4 | 24.4 | 25.9 | 15.4 | 17.3 | 0.46 | 36.8 |
| LLMDet-L | - | 4.5 | 5.3 | 2.4 | 5.5 | 4.4 | 22.2 | 1.2 | 6.5 | 0.21 | 27.3 |
| APE-D | - | 12.6 | 2.2 | 7.2 | 22.7 | 31.8 | 26.7 | 11.6 | 16.4 | 0.40 | 36.9 |
| DINO-X | - | 17.2 | 19.7 | 12.9 | 30.1 | 28.4 | 31.0 | 9.7 | 21.3 | 0.38 | 55.2 |
| Gemini 2.5 | - | 9.9 | 13.1 | 8.2 | 19.6 | 15.1 | 18.8 | 6.5 | 13.0 | 0.29 | 46.1 |
| SAM3 | 77 | 47.3 | 53.7 | 61.1 | 53.4 | 65.5 | 54.9 | 42.5 | 54.1 | 0.82 | 66.1 |
| SAM3-LiteText-S0 | 16 | 47.06 | 53.42 | 60.58 | 52.18 | 65.05 | 54.86 | 42.12 | 53.61 | 0.81 | 65.54 |
| SAM3-LiteText-S1 | 16 | 47.18 | 53.58 | 60.76 | 52.43 | 65.28 | 55.02 | 42.35 | 53.80 | 0.81 | 65.72 |
| SAM3-LiteText-L | 16 | 47.24 | 53.66 | 60.88 | 52.65 | 65.49 | 55.19 | 42.54 | 53.95 | 0.81 | 65.87 |
Note: This table shows performance of the released ctx-16 models, which were trained with a more extensive dataset mixture compared to the models reported in the paper. As a result, performance may differ slightly from the values in the associated publication.
CoreML / ONNX Export
Coming soon: export pipelines to ONNX and CoreML for cross-platform deployment.
Web Demo
Coming soon: an interactive web demo for real-time concept segmentation and tracking.
Development To-Do List
- Release Stage 1 Image Encoder Weights: Distilled image encoder weights from SAM3 image encoder for all 9 variants (RepViT, TinyViT, EfficientViT)
- Release Stage 1 Text Encoder Weights: Distill SAM3 text encoder weights to MobileCLIP-S1 combined with all 9 image encoder variants
- Release Stage 1+ Fine-Tuned Encoder Weights: Prompt-in-the-loop supervised fine-tuning for improved encoder performance
- Release SAM3-LiteText Weights: Distilled a lightweight MobileCLIP text encoder that is competitive to the SAM3 text encoder for efficient vision-language segmentation
- Release Stage 2 Memory Bank Aligned Model Weights: Models with Perceiver-based memory compression trained on SA-V dataset
- Release Stage 3 Fine-Tuned Model Weights: End-to-end fine-tuned models on SAM3 dataset with full PCS capabilities
- ONNX/CoreML Export: Export models to ONNX and CoreML formats for cross-platform deployment
- Web Demo: Interactive web demonstration for real-time concept segmentation and tracking
Call for Pull Requests
The idea for this repository originated from my work on SAM2 at Amazon, particularly as part of the research described in this paper. Since company policy, I cannot share the codebase. This year I am super excited to work on making SAM3 more efficient and accessible to the community.
We welcome contributions to EfficientSAM3! Please feel free to submit pull requests to improve the codebase, add new features, or fix bugs. Particularly, we are looking for: - Efficient MedSAM3 integration (see MedSAM2 by Bo Wang Lab) - A Gradio demo (e.g. EfficientTAM on Hugging Face Spaces) - A web demo deployed with Vercel (e.g. Segment Anything Web UI) - Annotation tools, such as X-AnyLabeling and AnyLabeling - An iOS or Android app (e.g. Cutcha Photo on the App Store) - An NVCC-based desktop application - Anything else that you think is cool!
All meaningful contributions will be acknowledged and integrated into both the repository and the associated paper. We warmly welcome all contributors to the repository and happily offer co-authorship to those whose work merits inclusion in the paper.
Citation
If you use EfficientSAM3 in your research, please cite:
@misc{zeng2025efficientsam3progressivehierarchicaldistillation,
title={EfficientSAM3: Progressive Hierarchical Distillation for Video Concept Segmentation from SAM1, 2, and 3},
author={Chengxi Zeng and Yuxuan Jiang and Aaron Zhang},
year={2025},
eprint={2511.15833},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.15833},
}
@misc{zeng2026sam3litetextanatomicalstudysam3,
title={SAM3-LiteText: An Anatomical Study of the SAM3 Text Encoder for Efficient Vision-Language Segmentation},
author={Chengxi Zeng and Yuxuan Jiang and Ge Gao and Shuai Wang and Duolikun Danier and Bin Zhu and Stevan Rudinac and David Bull and Fan Zhang},
year={2026},
eprint={2602.12173},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2602.12173},
}
License
This repository is licensed under the Apache 2.0 License.
This project builds upon SAM, SAM2, SAM3, EdgeSAM, EdgeTAM, EfficientTAM, RepViT, TinyViT, EfficientViT, and MobileCLIP. Please refer to their respective licenses for usage terms.
Acknowledgments
We gratefully acknowledge the University of Bristol Isambard-AI supercomputer cluster for providing computational resources to this project. Special thanks to Dr. Fan Aaron Zhang for allocating resources and supporting this research.
Users
Organizations and projects using EfficientSAM3:
![]() European Space Agency |
Note: If you're using EfficientSAM3 in your work, please acknowledge us in your publications or projects. We're happy to promote your work here! Contact us to be featured in this section.
