EfficientSAM3: Progressive Hierarchical Knowledge Distillation from SAM1, 2 and 3

Chengxi Simon Zeng^1,†, Yuxuan Jiang¹, Gao Ge¹, Shuai Wang², Duolikun Danier³, Bin Zhu⁴, Stevan Rudinac², David Bull¹, Fan Aaron Zhang¹

¹Visual Information Lab, University of Bristol; ²MultiX lab, University of Amsterdam; ³University of Edinburgh; ⁴Singapore Management University

^†Tech Lead & Corresponding Author

Highlights

Efficient Vision Encoders: Distilled into RepViT, TinyViT, and EfficientViT families (22-28M params vs SAM3's 463M)
Efficient Text Encoders: Distilled into MobileCLIP variants (42-124M vs SAM3's 354M)
Full PCS Models: Image + text encoders distilled for promptable concept segmentation
LiteText Models: Keep SAM3 vision encoder, replace text encoder only

Model Zoo

EfficientSAM3 Full Models (Lightweight Image + Text Encoders)

EfficientSAM3 compresses both SAM3's vision encoder and text encoder into lightweight student models while maintaining competitive performance on downstream benchmarks.

Model	Vision	Text	Decoder	Other	Params	vs ImageSAM3	Download
EV-M	22.2M	42.5M	21.0M	3.5M	89.2M	90% smaller	HF
RV-M	25.6M	42.5M	21.0M	3.5M	92.7M	89% smaller	HF
TV-M	28.3M	42.5M	21.0M	3.5M	95.3M	89% smaller	HF

Note: "Text" is the distilled text encoder. "Transformer" is the mask decoder. "Other" includes segmentation head + scoring. ImageSAM3 (for comparison): Vision: 463M + Text: 354M + Transformer: 30.3M + Other: 14.2M = 861.5M

SAM3-LiteText Models (Lightweight Text Encoder Only)

SAM3-LiteText keeps the SAM3 vision encoder but replaces the text encoder with lightweight MobileCLIP variants.

Model	Vision	Text	Decoder	Other	Params	vs ImageSAM3	Download
LiteText-S0-16	463.0M	42.5M	30.3M	14.2M	550.0M	36% smaller	HF
LiteText-S0-32	463.0M	42.5M	30.3M	14.2M	550.0M	36% smaller	HF
LiteText-S1-16	463.0M	63.5M	30.3M	14.2M	571.0M	34% smaller	HF
LiteText-S1-32	463.0M	63.5M	30.3M	14.2M	571.0M	34% smaller	HF
LiteText-L-16	463.0M	123.8M	30.3M	14.2M	631.3M	27% smaller	HF
LiteText-L-32	463.0M	123.8M	30.3M	14.2M	631.3M	27% smaller	HF

Note: "Text" is the distilled text encoder (42.5M-123.8M). SAM3-LiteText keeps SAM3's ViT-H vision encoder (~463M) but replaces the text encoder. "Other" includes geometry encoder + segmentation head + scoring.

Installation

git clone https://github.com/SimonZeng7108/efficientsam3
cd efficientsam3
pip install -e ".[stage1]"

Prerequisites:

Python 3.10+
PyTorch 2.0+
CUDA 11.8+ (for GPU support)

Quick Start

EfficientSAM3 (Full Models with Lightweight Encoders)

EfficientSAM3 replaces both the SAM3 vision encoder and text encoder with lightweight student models (EfficientViT/RepViT/TinyViT + MobileCLIP).

from sam3.model_builder import build_efficientsam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor
from PIL import Image

# Load EfficientSAM3 TV-M model (uses TinyViT vision encoder + MobileCLIP-S0 text encoder)
model = build_efficientsam3_image_model(
    checkpoint_path="efficientsam3_tinyvit.pt",
    backbone_type="tinyvit",
    model_name="11m",
    text_encoder_type="MobileCLIP-S0",
    text_encoder_context_length=16,
    load_from_HF=False,
)

# Process image
processor = Sam3Processor(model)
image = Image.open("your_image.jpg").convert("RGB")
state = processor.set_image(image)

# Text prompt segmentation
state = processor.set_text_prompt("dog", state)

# Get masks
masks = state["masks"]
scores = state["scores"]
print(f"Found {len(masks)} masks")

SAM3-LiteText

SAM3-LiteText keeps the SAM3 vision encoder but replaces the heavy text encoder with a lightweight MobileCLIP variant.

from sam3.model_builder import build_sam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor
from PIL import Image

# Build model with LiteText encoder (keeps SAM3 ViT, replaces text encoder)
model = build_sam3_image_model(
    checkpoint_path="sam3_litetext_mobileclip_s0_ctx16.pt",
    text_encoder_type="MobileCLIP-S0",
    text_encoder_context_length=16,
    load_from_HF=False,
)

# Use as normal
processor = Sam3Processor(model)
image = Image.open("your_image.jpg").convert("RGB")
state = processor.set_image(image)
state = processor.set_text_prompt("person", state)
masks = state["masks"]

Training and Evaluation

Training:

Stage 1: Encoder distillation training details in README_stage1.md
Stage 3: Full fine-tuning details in README_stage3.md

Evaluation:

To evaluate models on COCO dataset:

python eval/eval_coco.py --coco_root data/coco --output_dir output

To evaluate text encoder quality (token-level cosine similarity vs SAM3 teacher):

python eval/eval_text_encoder_similarity.py \
  --student-ckpt /path/to/student_text_encoder_1.pth /path/to/student_text_encoder_2.pth \
  --np-json data/sa-v-text/sa-co-veval/saco_veval_noun_phrases.json \
  --device cuda
# Optional: override teacher checkpoint
python eval/eval_text_encoder_similarity.py \
  --teacher-ckpt /path/to/teacher.pth \
  --student-ckpt /path/to/student.pth \
  --np-json data/sa-v-text/sa-co-veval/saco_veval_noun_phrases.json \
  --device cuda

Datasets

For dataset setup and download scripts (data/download_*.sh) covering COCO, DAVIS, LVIS, SA-1B, SA-V, LVOS, MOSE, and YouTube-VOS, see:

README_dataset.md

To-Do List

Release Stage 1 Image Encoder Weights: Distilled image encoder weights from SAM3 image encoder for all 9 variants (RepViT, TinyViT, EfficientViT)
Release Stage 1 Text Encoder Weights: Distill SAM3 text encoder weights to MobileCLIP-S1 combined with all 9 image encoder variants
Release Stage 1+ Fine-Tuned Encoder Weights: Prompt-in-the-loop supervised fine-tuning for improved encoder performance
Release SAM3-LiteText Weights: Distilled a lightweight MobileCLIP text encoder that is competitive to the SAM3 text encoder for efficient vision-language segmentation
Release Stage 2 Memory Bank Aligned Model Weights: Models with Perceiver-based memory compression trained on SA-V dataset
Release Stage 3 Fine-Tuned Model Weights: End-to-end fine-tuned models on SAM3 dataset with full PCS capabilities
ONNX/CoreML Export: Export models to ONNX and CoreML formats for cross-platform deployment
Web Demo: Interactive web demonstration for real-time concept segmentation and tracking

Call for Pull Requests

The idea for this repository originated from my work on SAM2 at Amazon, particularly as part of the research described in this paper. Since company policy, I cannot share the codebase. This year I am super excited to work on making SAM3 more efficient and accessible to the community.

We welcome contributions to EfficientSAM3! Please feel free to submit pull requests to improve the codebase, add new features, or fix bugs. Particularly, we are looking for:

Efficient MedSAM3 integration (see MedSAM2 by Bo Wang Lab)
A Gradio demo (e.g. EfficientTAM on Hugging Face Spaces)
A web demo deployed with Vercel (e.g. Segment Anything Web UI)
Annotation tools, such as X-AnyLabeling and AnyLabeling
An iOS or Android app (e.g. Cutcha Photo on the App Store)
An NVCC-based desktop application
Anything else that you think is cool!

All meaningful contributions will be acknowledged and integrated into both the repository and the associated paper. We warmly welcome all contributors to the repository and happily offer co-authorship to those whose work merits inclusion in the paper.

Citations

If you find EfficientSAM3 useful in your research, please cite:

@misc{zeng2025efficientsam3,
      title={EfficientSAM3: Progressive Hierarchical Distillation for Video Concept Segmentation from SAM1, 2, and 3},
      author={Chengxi Zeng and Yuxuan Jiang and Gao Ge and Shuai Wang and Duolikun Danier and Bin Zhu and Stevan Rudinac and David Bull and Fan Zhang},
      year={2025},
      eprint={2511.15833},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.15833},
}

@misc{zeng2026sam3litetext,
      title={SAM3-LiteText: An Anatomical Study of the SAM3 Text Encoder for Efficient Vision-Language Segmentation},
      author={Chengxi Zeng and Yuxuan Jiang and Gao Ge and Shuai Wang and Duolikun Danier and Bin Zhu and Stevan Rudinac and David Bull and Fan Zhang},
      year={2026},
      eprint={2602.12173},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2602.12173},
}

License

This repository is licensed under the Apache 2.0 License.

This project builds upon SAM, SAM2, SAM3, EdgeSAM, EdgeTAM, EfficientTAM, RepViT, TinyViT, EfficientViT, and MobileCLIP. Please refer to their respective licenses for usage terms.

Acknowledgments

We gratefully acknowledge the University of Bristol Isambard-AI supercomputer cluster for providing computational resources to this project. Special thanks to Dr. Fan Aaron Zhang for allocating resources and supporting this research.

Users

Organizations and projects using EfficientSAM3:

European Space Agency

Note: If you're using EfficientSAM3 in your work, please acknowledge us in your publications or projects. We're happy to promote your work here! Contact us to be featured in this section.

Name		Name	Last commit message	Last commit date
Latest commit History 130 Commits
data		data
docs		docs
eval		eval
images		images
sam3		sam3
sam3_checkpoints		sam3_checkpoints
stage1		stage1
stage1_geometry_finetune		stage1_geometry_finetune
stage3		stage3
.gitignore		.gitignore
README.md		README.md
README_dataset.md		README_dataset.md
README_stage1.md		README_stage1.md
README_stage1_finetune.md		README_stage1_finetune.md
README_stage3.md		README_stage3.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EfficientSAM3: Progressive Hierarchical Knowledge Distillation from SAM1, 2 and 3

Table of Contents

Highlights

Model Zoo

EfficientSAM3 Full Models (Lightweight Image + Text Encoders)

SAM3-LiteText Models (Lightweight Text Encoder Only)

Installation

Quick Start

EfficientSAM3 (Full Models with Lightweight Encoders)

SAM3-LiteText

Training and Evaluation

Datasets

To-Do List

Call for Pull Requests

Citations

License

Acknowledgments

Users

About

Uh oh!

Releases 4

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EfficientSAM3: Progressive Hierarchical Knowledge Distillation from SAM1, 2 and 3

Table of Contents

Highlights

Model Zoo

EfficientSAM3 Full Models (Lightweight Image + Text Encoders)

SAM3-LiteText Models (Lightweight Text Encoder Only)

Installation

Quick Start

EfficientSAM3 (Full Models with Lightweight Encoders)

SAM3-LiteText

Training and Evaluation

Datasets

To-Do List

Call for Pull Requests

Citations

License

Acknowledgments

Users

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages