EfficientSAM3: Progressive Hierarchical Knowledge Distillation (PhD) from SAM1, 2 and 3

Chengxi Simon Zeng1,βˆ—, Yuxuan Jiang1, Ge Gao1, Shuai Wang2, Duolikun Danier3, Bin Zhu4, Stevan Rudinac2, David Bull1, Fan Aaron Zhang1,†

1Visual Information Lab, University of Bristol; 2University of Amsterdam; 3School of Informatics, University of Edinburgh; 4Singapore Management University

βˆ—Primary Contributor †Corresponding Author

πŸ“„ Paper

Updates

  • [2026/02/18] SAM3-LiteText released! SAM3-LiteText reduces text encoder parameters by up to 88% with similar performance to the original text encoder. Paper available on arXiv.
  • [2026/01/11] Stage 1 geometry-prompt fine-tuned (ft) weights released/updated (image encoders on 1% SA-1B; text encoders fine-tuned on SA-Co Gold+Silver).
  • [2025/12/08] Stage 1 text encoder weights released for all 3 variants (MobileCLIP S0, S1, and MobileCLIP2 L) - distilled on 1% Recap-DataComp-1B dataset.
  • [2025/12/02] Stage 1 image encoder weights released for all 9 variants (RepViT, TinyViT, EfficientViT) - unsupervised distilled on 1% of SA-1B dataset.
  • [2025/11/25] Teaser model released. See Above. More models are baking in the ovenπŸ”₯.
  • [2025/10/18] Project announced. Code and weights are not released yet; they will be published once SAM3 code is publicly available.

Table of Contents


SAM3 (Segment Anything Model 3) has introduced powerful Promptable Concept Segmentation (PCS) capabilities, enabling semantic understanding and temporal object tracking beyond traditional mask generation. However, SAM3's massive vision backbone and dense memory bank make it impractical for real-time, on-device applications where computational resources and latency constraints are critical.

EfficientSAM3 addresses this challenge by distilling SAM3's capabilities into lightweight architectures suitable for edge devices, enabling high-quality concept segmentation on mobile phones, embedded systems, and resource-constrained platforms.

EfficientSAM3 Architecture


Supported Models and Architecture
Component Model/Backbone Purpose
Teacher Models SAM (Segment Anything Model) Foundation for image-level encoder distillation
SAM2 Temporal memory and video tracking distillation
SAM3 Promptable Concept Segmentation (PCS) capabilities
Datasets SA-1B Image segmentation dataset
SA-V Video object segmentation dataset
SA-Co/Gold Promptable concept segmentation benchmark
Recap-DataComp-1B Large-scale image-text dataset for text encoder distillation
Student Backbones (Image) RepViT (M0.9, M1.1, M2.3) Mobile-optimized Vision Transformer for highest throughput
TinyViT (5M, 11M, 21M) Balanced efficiency and performance
EfficientViT (B0, B1, B2) Ultra-lightweight architectures for minimal latency
Student Backbones (Text) MobileCLIP S0 Lightweight text encoder (42.57M params)
MobileCLIP S1 Balanced text encoder (63.56M params)
MobileCLIP2 L Larger text encoder (123.6M params)

Three-Stage Progressive Training Curriculum

EfficientSAM3 is trained through a three-stage progressive distillation:

Stage 1: Encoder Distillation (Image-Level Segmentation)

  • Distill the SAM3 image encoder to nine student backbones (3 RepViT Γ— 3 TinyViT Γ— 3 EfficientViT variants)
  • Distill the SAM3 text encoder to three student text encoders (MobileCLIP S0, S1, 2-L variants)
  • Use SA-1B dataset with Prompt-in-the-Loop Distillation for image encoder distillation
  • Use Recap-DataComp-1B dataset for text encoder distillation
  • Align student backbone features with teacher encoder outputs.

Stage 2: Temporal Memory Distillation (Video Tracking)

  • Replace SAM3's dense memory bank with a compact Perceiver-based memory module (adapted from EdgeTAM)
  • Distill memory-conditioned mask predictions using SA-V dataset
  • Train the Perceiver module to compress and retrieve spatiotemporal features efficiently

Stage 3: End-to-End Fine-Tuning (Concept Segmentation)

  • Refine the complete EfficientSAM3 pipeline using SAM3 official dataset
  • Joint optimization of distilled encoder + compressed memory + mask decoder
  • Preserve Promptable Concept Segmentation capabilities while maintaining efficiency

tl;dr

Stage 1: We distill the SAM3 encoder using SAM1 data.
Stage 2: We align the distilled encoder to a perceiver and an efficient memory bank using SAM2 data.
Stage 3: We fine-tune the complete pipeline using SAM3 data.


Installation

EfficientSAM3 purposely shares the same software contract as upstream SAM3:

  • Python β‰₯ 3.12
  • PyTorch 2.7.0 (CUDA 12.6 build recommended)
  • CUDA-capable GPUs with drivers that support CUDA β‰₯ 12.6

Follow the exact environment setup from the official SAM3 README or use the condensed steps below:

git clone https://github.com/SimonZeng7108/efficientsam3.git
cd efficientsam3

conda create -n efficientsam3 python=3.12 -y
conda activate efficientsam3

pip install --upgrade pip
pip install torch==2.7.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126

# Install repo dependencies via the root pyproject (brings in SAM3 + Stage-1 extras)
pip install -e ".[stage1]"

# Note: the Stage-1 extra includes the SAM1 package dependency
# (PyPI name: segment-anything, import name: segment_anything).
# If your environment cannot resolve it from PyPI, install the vendored repo instead:
# pip install -e ./segment-anything

Inference

Download checkpoints from the Model Zoo section. All Stage 1 image encoder weights are available via Google Drive and Hugging Face links in the table below.

Quick Start (Image Segmentation):

πŸ”₯ Teaser Image Model

EfficientViT-S (0.68M params) distilled from SAM3 Encoder (461.84M) β€” 99.85% smaller, trained on 1% SA-1B.

from sam3.model_builder import build_efficientsam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor

# Load model
model = build_efficientsam3_image_model(
  checkpoint_path="efficient_sam3_efficientvit_s.pt",
  backbone_type="efficientvit",
  model_name="b0",
  enable_inst_interactivity=True,
)

# Process image and predict
processor = Sam3Processor(model)
inference_state = processor.set_image(image)

# Single positive point prompt (x, y) in pixels
points = [[image.size[0] / 2, image.size[1] / 2]]
labels = [1]
masks, scores, _ = model.predict_inst(
    inference_state, 
    point_coords=points, 
    point_labels=labels
)

πŸ”₯ Teaser Text Prompt Model

MobileCLIP-S1 (63.56M) distilled from SAM3 Text Encoder (353.72M) β€” trained on 1% Recap-DataComp-1B.

from sam3.model_builder import build_efficientsam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor

# Load model with text encoder
model = build_efficientsam3_image_model(
    checkpoint_path="efficient_sam3_tinyvit_m_mobileclip_s1.pt",
    backbone_type="tinyvit",
    model_name="11m",
    text_encoder_type="MobileCLIP-S1"
)

# Process image and predict with text prompt
processor = Sam3Processor(model)
inference_state = processor.set_image(image)
inference_state = processor.set_text_prompt(prompt="shoe", state=inference_state)
masks = inference_state["masks"]
scores = inference_state["scores"]
print(len(scores), scores)

πŸ”₯ SAM3-LiteText Model

Build a SAM3-LiteText model with a single call β€” the builder handles text encoder creation, checkpoint loading, and context length truncation internally.

from sam3.model_builder import build_sam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor

# Build SAM3-LiteText model
# Supported text_encoder_type: "MobileCLIP-S0", "MobileCLIP-S1", "MobileCLIP2-L"
# Supported text_encoder_context_length: 16, 32, or 77
model = build_sam3_image_model(
    checkpoint_path="efficient_sam3_image_encoder_mobileclip_s1_ctx32.pt",
    load_from_HF=False,
    text_encoder_type="MobileCLIP-S1",
    text_encoder_context_length=16,
    device='cuda',
)

# Run inference
processor = Sam3Processor(model, device='cuda', confidence_threshold=0.4)
state = processor.set_image(image)
state = processor.set_text_prompt("shoe", state)
masks = state["masks"]
scores = state["scores"]

For detailed examples including point/box prompts, batched inference, and more, see sam3/efficientsam3_examples/efficientsam3_for_sam1_task_example.py. For text prompt inference, see sam3/efficientsam3_examples/efficientsam3_image_predictor_example.ipynb. For SAM3-LiteText inference examples, see sam3/efficientsam3_examples/efficientsam3_litetext_image_inference_example.py (image) and sam3/efficientsam3_examples/efficientsam3_litetext_video_predictor_example.ipynb (video).


Training and Evaluation

Training:

  • For Stage 1 encoder distillation training details, see README_stage1.md. For Stage 1 geometry fine-tuning, check the stage1_geometry_finetune branch.
  • Stage 2 and Stage 3 training details coming soon.

Evaluation:

  • To evaluate models on COCO dataset:

    python eval/eval_coco.py --coco_root data/coco --output_dir output
    
  • To evaluate text encoder quality (token-level cosine similarity vs SAM3 teacher):

    python eval/eval_text_encoder_similarity.py \
      --student-ckpt /path/to/student_text_encoder_1.pth /path/to/student_text_encoder_2.pth \
      --np-json data/sa-v-text/sa-co-veval/saco_veval_noun_phrases.json \
      --device cuda
    # Optional: override teacher checkpoint
    #   --teacher-ckpt /path/to/sam3_teacher_checkpoint.pt
    

Datasets

For dataset setup and download scripts (data/download_*.sh) covering COCO, DAVIS, LVIS, SA-1B, SA-V, LVOS, MOSE, and YouTube-VOS, see:


EfficientSAM3 Model Zoo & Weight Release

SAM3 Text Encoder + EfficientSAM3 Image Encoder Models

Model Name Backbone Parameters Stage 1 Weights
(Encoder Distilled)
Stage 2 Weights
(Memory Module Trained)
Stage 3 Weights
(End-to-End Fine-Tuned)
ES-RV-S RepViT-M0.9 4.72M HF $$\text{Planned}$$ $$\text{Planned}$$
ES-RV-M RepViT-M1.1 7.77M HF (ft: HF) $$\text{Planned}$$ $$\text{Planned}$$
ES-RV-L RepViT-M2.3 22.40M HF $$\text{Planned}$$ $$\text{Planned}$$
ES-TV-S TinyViT-5M 5.07M HF $$\text{Planned}$$ $$\text{Planned}$$
ES-TV-M TinyViT-11M 10.55M HF (ft: HF) $$\text{Planned}$$ $$\text{Planned}$$
ES-TV-L TinyViT-21M 20.62M HF $$\text{Planned}$$ $$\text{Planned}$$
ES-EV-S EfficientViT-B0 0.68M HF $$\text{Planned}$$ $$\text{Planned}$$
ES-EV-M EfficientViT-B1 4.64M HF (ft: HF) $$\text{Planned}$$ $$\text{Planned}$$
ES-EV-L EfficientViT-B2 14.98M HF $$\text{Planned}$$ $$\text{Planned}$$

Note (2025/12/02): The current Stage 1 image encoder weights are distilled on 1% of the SA-1B dataset.

Note (2026/01/11): The fine-tuned (ft) models use geometry-prompt fine-tuning on the same 1% subset of SA-1B; see training details in the stage1_geometry_finetune branch.

EfficientSAM3 Text Encoder + EfficientSAM3 Image Encoder Models

Model Name Backbone Parameters Stage 1 Weights
(Encoder Distilled)
Stage 2 Weights
(Memory Module Trained)
Stage 3 Weights
(End-to-End Fine-Tuned)
ES-RV-S-MC-S1 RepViT-M0.9 & MobileCLIP-S1 4.72M + 63.56M HF $$\text{Planned}$$ $$\text{Planned}$$
ES-RV-M-MC-S1 RepViT-M1.1 & MobileCLIP-S1 7.77M + 63.56M HF (ft: HF) $$\text{Planned}$$ $$\text{Planned}$$
ES-RV-L-MC-S1 RepViT-M2.3 & MobileCLIP-S1 22.40M + 63.56M HF $$\text{Planned}$$ $$\text{Planned}$$
ES-TV-S-MC-S1 TinyViT-5M & MobileCLIP-S1 5.07M + 63.56M HF $$\text{Planned}$$ $$\text{Planned}$$
ES-TV-M-MC-S1 TinyViT-11M & MobileCLIP-S1 10.55M + 63.56M HF (ft: HF) $$\text{Planned}$$ $$\text{Planned}$$
ES-TV-L-MC-S1 TinyViT-21M & MobileCLIP-S1 20.62M + 63.56M HF $$\text{Planned}$$ $$\text{Planned}$$
ES-EV-S-MC-S1 EfficientViT-B0 & MobileCLIP-S1 0.68M + 63.56M HF $$\text{Planned}$$ $$\text{Planned}$$
ES-EV-M-MC-S1 EfficientViT-B1 & MobileCLIP-S1 4.64M + 63.56M HF (ft: HF) $$\text{Planned}$$ $$\text{Planned}$$
ES-EV-L-MC-S1 EfficientViT-B2 & MobileCLIP-S1 14.98M + 63.56M HF $$\text{Planned}$$ $$\text{Planned}$$

Note (2025/12/08): The current Stage 1 text encoder weights are distilled on 1% of the Recap-DataComp-1B dataset combined with all 9 image encoder variants. We notice a performance degradation, this is expected as the text encoder are not aligning with the light image encoders in stage1. We will release the stage1+ fine-tuned weights in the future.

Note (2025/12/08): We have also uploaded standalone text encoder weights trained on 1% Recap-DataComp-1B dataset: MobileCLIP-S1 and MobileCLIP2-L. You can merge with stage 1 trained image encoder weights to get the full model.

Note (2026/01/11): The fine-tuned (ft) text encoder models are fine-tuned on SA-Co Gold+Silver text annotations. Standalone fine-tuned text encoder weights: MobileCLIP-S0, MobileCLIP-S1, and MobileCLIP2-L.

SAM3-LiteText Models

SAM3-LiteText replaces the SAM3 text encoder with a lightweight distilled text encoder, reducing text encoder parameters by up to 88% with comparable performance. See the SAM3-LiteText paper for details.

Model Text Encoder Ctx Text Params Weights
SAM3-LiteText-S0-16 MobileCLIP-S0 16 42.54M HF
SAM3-LiteText-S1-16 MobileCLIP-S1 16 63.53M HF
SAM3-LiteText-L-16 MobileCLIP2-L 16 123.80M HF

All models use the SAM3 ViT-H image encoder (353.72M vision params). The text encoder parameters shown represent the distilled student replacing the original 353.72M text encoder, achieving up to 88% parameter reduction.


Preliminary Evaluation

Stage 1 Image Model Evaluation Results (COCO val2017)
Model Name Backbone Parameters COCO mIoU Test Time (s)
ES-RV-S RepViT-M0.9 4.72M 64.80% 407.23
ES-RV-M RepViT-M1.1 7.77M 65.28% (ft 65.60%) 413.38
ES-RV-L RepViT-M2.3 22.40M 65.53% 466.66
ES-TV-S TinyViT-5M 5.07M 65.51% 430.52
ES-TV-M TinyViT-11M 10.55M 65.45% (ft 65.69%) 443.45
ES-TV-L TinyViT-21M 20.62M 66.29% 452.14
ES-EV-S EfficientViT-B0 0.68M 61.62% 419.57
ES-EV-M EfficientViT-B1 4.64M 64.82% (ft 64.94%) 434.45
ES-EV-L EfficientViT-B2 14.98M 66.30% 450.36

Note: The evaluation is done with a single NVIDIA 4070 Ti.

Stage 1 Text Encoder Evaluation Results (SA-Co/VEval Noun Phrases)

Metric: average token-level cosine similarity between student text features and SAM3 text encoder features.

Pretrained on 1% Recap-DataComp-1B

Model Name Text Backbone Avg Cos Similarity Eval Set
ES-MC-S0 (Recap-DC1B 1% pt) MobileCLIP-S0 0.864846 5184 noun phrases
ES-MC-S1 (Recap-DC1B 1% pt) MobileCLIP-S1 0.854405 5184 noun phrases
ES-MC2-L (Recap-DC1B 1% pt) MobileCLIP2-L 0.850976 5184 noun phrases

Fine-tuned on SA-Co Gold+Silver text annotations

Model Name Text Backbone Avg Cos Similarity Eval Set
ES-MC-S0 (SA-Co ft) MobileCLIP-S0 0.938915 5184 noun phrases
ES-MC-S1 (SA-Co ft) MobileCLIP-S1 0.947152 5184 noun phrases
ES-MC2-L (SA-Co ft) MobileCLIP2-L 0.952901 5184 noun phrases

Note: Evaluation is done with eval_text_encoder_similarity.py using data/sa-v-text/sa-co-veval/saco_veval_noun_phrases.json. Pretrained models are trained on Recap-DataComp-1B (1%), and fine-tuned models are trained on SA-Co Gold+Silver text annotations.

SAM3-LiteText Evaluation Results (SA-Co/Gold, Metric: CG_F1)
Model Ctx MetaClip SA1B Crowd Food SptEq Attr Wiki Avg F1 MCC pmF1
gDino-T - 2.9 3.1 0.28 0.96 1.1 13.8 0.70 3.3 0.15 16.2
OWLv2 - 12.2 9.8 8.9 24.4 24.4 25.9 15.4 17.3 0.46 36.8
LLMDet-L - 4.5 5.3 2.4 5.5 4.4 22.2 1.2 6.5 0.21 27.3
APE-D - 12.6 2.2 7.2 22.7 31.8 26.7 11.6 16.4 0.40 36.9
DINO-X - 17.2 19.7 12.9 30.1 28.4 31.0 9.7 21.3 0.38 55.2
Gemini 2.5 - 9.9 13.1 8.2 19.6 15.1 18.8 6.5 13.0 0.29 46.1
SAM3 77 47.3 53.7 61.1 53.4 65.5 54.9 42.5 54.1 0.82 66.1
SAM3-LiteText-S0 16 47.06 53.42 60.58 52.18 65.05 54.86 42.12 53.61 0.81 65.54
SAM3-LiteText-S1 16 47.18 53.58 60.76 52.43 65.28 55.02 42.35 53.80 0.81 65.72
SAM3-LiteText-L 16 47.24 53.66 60.88 52.65 65.49 55.19 42.54 53.95 0.81 65.87

Note: This table shows performance of the released ctx-16 models, which were trained with a more extensive dataset mixture compared to the models reported in the paper. As a result, performance may differ slightly from the values in the associated publication.


CoreML / ONNX Export

Coming soon: export pipelines to ONNX and CoreML for cross-platform deployment.


Web Demo

Coming soon: an interactive web demo for real-time concept segmentation and tracking.


Development To-Do List

  • Release Stage 1 Image Encoder Weights: Distilled image encoder weights from SAM3 image encoder for all 9 variants (RepViT, TinyViT, EfficientViT)
  • Release Stage 1 Text Encoder Weights: Distill SAM3 text encoder weights to MobileCLIP-S1 combined with all 9 image encoder variants
  • Release Stage 1+ Fine-Tuned Encoder Weights: Prompt-in-the-loop supervised fine-tuning for improved encoder performance
  • Release SAM3-LiteText Weights: Distilled a lightweight MobileCLIP text encoder that is competitive to the SAM3 text encoder for efficient vision-language segmentation
  • Release Stage 2 Memory Bank Aligned Model Weights: Models with Perceiver-based memory compression trained on SA-V dataset
  • Release Stage 3 Fine-Tuned Model Weights: End-to-end fine-tuned models on SAM3 dataset with full PCS capabilities
  • ONNX/CoreML Export: Export models to ONNX and CoreML formats for cross-platform deployment
  • Web Demo: Interactive web demonstration for real-time concept segmentation and tracking

Call for Pull Requests

The idea for this repository originated from my work on SAM2 at Amazon, particularly as part of the research described in this paper. Since company policy, I cannot share the codebase. This year I am super excited to work on making SAM3 more efficient and accessible to the community.

We welcome contributions to EfficientSAM3! Please feel free to submit pull requests to improve the codebase, add new features, or fix bugs. Particularly, we are looking for: - Efficient MedSAM3 integration (see MedSAM2 by Bo Wang Lab) - A Gradio demo (e.g. EfficientTAM on Hugging Face Spaces) - A web demo deployed with Vercel (e.g. Segment Anything Web UI) - Annotation tools, such as X-AnyLabeling and AnyLabeling - An iOS or Android app (e.g. Cutcha Photo on the App Store) - An NVCC-based desktop application - Anything else that you think is cool!

All meaningful contributions will be acknowledged and integrated into both the repository and the associated paper. We warmly welcome all contributors to the repository and happily offer co-authorship to those whose work merits inclusion in the paper.

Citation

If you use EfficientSAM3 in your research, please cite:

@misc{zeng2025efficientsam3progressivehierarchicaldistillation,
      title={EfficientSAM3: Progressive Hierarchical Distillation for Video Concept Segmentation from SAM1, 2, and 3}, 
      author={Chengxi Zeng and Yuxuan Jiang and Aaron Zhang},
      year={2025},
      eprint={2511.15833},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.15833}, 
}
@misc{zeng2026sam3litetextanatomicalstudysam3,
      title={SAM3-LiteText: An Anatomical Study of the SAM3 Text Encoder for Efficient Vision-Language Segmentation}, 
      author={Chengxi Zeng and Yuxuan Jiang and Ge Gao and Shuai Wang and Duolikun Danier and Bin Zhu and Stevan Rudinac and David Bull and Fan Zhang},
      year={2026},
      eprint={2602.12173},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2602.12173}, 
}

License

This repository is licensed under the Apache 2.0 License.

This project builds upon SAM, SAM2, SAM3, EdgeSAM, EdgeTAM, EfficientTAM, RepViT, TinyViT, EfficientViT, and MobileCLIP. Please refer to their respective licenses for usage terms.

Acknowledgments

We gratefully acknowledge the University of Bristol Isambard-AI supercomputer cluster for providing computational resources to this project. Special thanks to Dr. Fan Aaron Zhang for allocating resources and supporting this research.


Users

Organizations and projects using EfficientSAM3:

European Space Agency
European Space Agency

Note: If you're using EfficientSAM3 in your work, please acknowledge us in your publications or projects. We're happy to promote your work here! Contact us to be featured in this section.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Papers for Simon7108528/EfficientSAM3