Chengxi Simon Zeng1,†, Yuxuan Jiang1, Gao Ge1, Shuai Wang2, Duolikun Danier3, Bin Zhu4, Stevan Rudinac2, David Bull1, Fan Aaron Zhang1
1Visual Information Lab, University of Bristol; 2MultiX lab, University of Amsterdam; 3University of Edinburgh; 4Singapore Management University
†Tech Lead & Corresponding Author
- Efficient Vision Encoders: Distilled into RepViT, TinyViT, and EfficientViT families (22-28M params vs SAM3's 463M)
- Efficient Text Encoders: Distilled into MobileCLIP variants (42-124M vs SAM3's 354M)
- Full PCS Models: Image + text encoders distilled for promptable concept segmentation
- LiteText Models: Keep SAM3 vision encoder, replace text encoder only
EfficientSAM3 compresses both SAM3's vision encoder and text encoder into lightweight student models while maintaining competitive performance on downstream benchmarks.
| Model | Vision | Text | Decoder | Other | Params | vs ImageSAM3 | Download |
|---|---|---|---|---|---|---|---|
| EV-M | 22.2M | 42.5M | 21.0M | 3.5M | 89.2M | 90% smaller | HF |
| RV-M | 25.6M | 42.5M | 21.0M | 3.5M | 92.7M | 89% smaller | HF |
| TV-M | 28.3M | 42.5M | 21.0M | 3.5M | 95.3M | 89% smaller | HF |
Note: "Text" is the distilled text encoder. "Transformer" is the mask decoder. "Other" includes segmentation head + scoring. ImageSAM3 (for comparison): Vision: 463M + Text: 354M + Transformer: 30.3M + Other: 14.2M = 861.5M
SAM3-LiteText keeps the SAM3 vision encoder but replaces the text encoder with lightweight MobileCLIP variants.
| Model | Vision | Text | Decoder | Other | Params | vs ImageSAM3 | Download |
|---|---|---|---|---|---|---|---|
| LiteText-S0-16 | 463.0M | 42.5M | 30.3M | 14.2M | 550.0M | 36% smaller | HF |
| LiteText-S0-32 | 463.0M | 42.5M | 30.3M | 14.2M | 550.0M | 36% smaller | HF |
| LiteText-S1-16 | 463.0M | 63.5M | 30.3M | 14.2M | 571.0M | 34% smaller | HF |
| LiteText-S1-32 | 463.0M | 63.5M | 30.3M | 14.2M | 571.0M | 34% smaller | HF |
| LiteText-L-16 | 463.0M | 123.8M | 30.3M | 14.2M | 631.3M | 27% smaller | HF |
| LiteText-L-32 | 463.0M | 123.8M | 30.3M | 14.2M | 631.3M | 27% smaller | HF |
Note: "Text" is the distilled text encoder (42.5M-123.8M). SAM3-LiteText keeps SAM3's ViT-H vision encoder (~463M) but replaces the text encoder. "Other" includes geometry encoder + segmentation head + scoring.
git clone https://github.com/SimonZeng7108/efficientsam3
cd efficientsam3
pip install -e ".[stage1]"Prerequisites:
- Python 3.10+
- PyTorch 2.0+
- CUDA 11.8+ (for GPU support)
EfficientSAM3 replaces both the SAM3 vision encoder and text encoder with lightweight student models (EfficientViT/RepViT/TinyViT + MobileCLIP).
from sam3.model_builder import build_efficientsam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor
from PIL import Image
# Load EfficientSAM3 TV-M model (uses TinyViT vision encoder + MobileCLIP-S0 text encoder)
model = build_efficientsam3_image_model(
checkpoint_path="efficientsam3_tinyvit.pt",
backbone_type="tinyvit",
model_name="11m",
text_encoder_type="MobileCLIP-S0",
text_encoder_context_length=16,
load_from_HF=False,
)
# Process image
processor = Sam3Processor(model)
image = Image.open("your_image.jpg").convert("RGB")
state = processor.set_image(image)
# Text prompt segmentation
state = processor.set_text_prompt("dog", state)
# Get masks
masks = state["masks"]
scores = state["scores"]
print(f"Found {len(masks)} masks")SAM3-LiteText keeps the SAM3 vision encoder but replaces the heavy text encoder with a lightweight MobileCLIP variant.
from sam3.model_builder import build_sam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor
from PIL import Image
# Build model with LiteText encoder (keeps SAM3 ViT, replaces text encoder)
model = build_sam3_image_model(
checkpoint_path="sam3_litetext_mobileclip_s0_ctx16.pt",
text_encoder_type="MobileCLIP-S0",
text_encoder_context_length=16,
load_from_HF=False,
)
# Use as normal
processor = Sam3Processor(model)
image = Image.open("your_image.jpg").convert("RGB")
state = processor.set_image(image)
state = processor.set_text_prompt("person", state)
masks = state["masks"]Training:
- Stage 1: Encoder distillation training details in README_stage1.md
- Stage 3: Full fine-tuning details in README_stage3.md
Evaluation:
-
To evaluate models on COCO dataset:
python eval/eval_coco.py --coco_root data/coco --output_dir output
-
To evaluate text encoder quality (token-level cosine similarity vs SAM3 teacher):
python eval/eval_text_encoder_similarity.py \ --student-ckpt /path/to/student_text_encoder_1.pth /path/to/student_text_encoder_2.pth \ --np-json data/sa-v-text/sa-co-veval/saco_veval_noun_phrases.json \ --device cuda # Optional: override teacher checkpoint python eval/eval_text_encoder_similarity.py \ --teacher-ckpt /path/to/teacher.pth \ --student-ckpt /path/to/student.pth \ --np-json data/sa-v-text/sa-co-veval/saco_veval_noun_phrases.json \ --device cuda
For dataset setup and download scripts (data/download_*.sh) covering COCO, DAVIS, LVIS, SA-1B, SA-V, LVOS, MOSE, and YouTube-VOS, see:
- Release Stage 1 Image Encoder Weights: Distilled image encoder weights from SAM3 image encoder for all 9 variants (RepViT, TinyViT, EfficientViT)
- Release Stage 1 Text Encoder Weights: Distill SAM3 text encoder weights to MobileCLIP-S1 combined with all 9 image encoder variants
- Release Stage 1+ Fine-Tuned Encoder Weights: Prompt-in-the-loop supervised fine-tuning for improved encoder performance
- Release SAM3-LiteText Weights: Distilled a lightweight MobileCLIP text encoder that is competitive to the SAM3 text encoder for efficient vision-language segmentation
- Release Stage 2 Memory Bank Aligned Model Weights: Models with Perceiver-based memory compression trained on SA-V dataset
- Release Stage 3 Fine-Tuned Model Weights: End-to-end fine-tuned models on SAM3 dataset with full PCS capabilities
- ONNX/CoreML Export: Export models to ONNX and CoreML formats for cross-platform deployment
- Web Demo: Interactive web demonstration for real-time concept segmentation and tracking
The idea for this repository originated from my work on SAM2 at Amazon, particularly as part of the research described in this paper. Since company policy, I cannot share the codebase. This year I am super excited to work on making SAM3 more efficient and accessible to the community.
We welcome contributions to EfficientSAM3! Please feel free to submit pull requests to improve the codebase, add new features, or fix bugs. Particularly, we are looking for:
- Efficient MedSAM3 integration (see MedSAM2 by Bo Wang Lab)
- A Gradio demo (e.g. EfficientTAM on Hugging Face Spaces)
- A web demo deployed with Vercel (e.g. Segment Anything Web UI)
- Annotation tools, such as X-AnyLabeling and AnyLabeling
- An iOS or Android app (e.g. Cutcha Photo on the App Store)
- An NVCC-based desktop application
- Anything else that you think is cool!
All meaningful contributions will be acknowledged and integrated into both the repository and the associated paper. We warmly welcome all contributors to the repository and happily offer co-authorship to those whose work merits inclusion in the paper.
If you find EfficientSAM3 useful in your research, please cite:
@misc{zeng2025efficientsam3,
title={EfficientSAM3: Progressive Hierarchical Distillation for Video Concept Segmentation from SAM1, 2, and 3},
author={Chengxi Zeng and Yuxuan Jiang and Gao Ge and Shuai Wang and Duolikun Danier and Bin Zhu and Stevan Rudinac and David Bull and Fan Zhang},
year={2025},
eprint={2511.15833},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.15833},
}
@misc{zeng2026sam3litetext,
title={SAM3-LiteText: An Anatomical Study of the SAM3 Text Encoder for Efficient Vision-Language Segmentation},
author={Chengxi Zeng and Yuxuan Jiang and Gao Ge and Shuai Wang and Duolikun Danier and Bin Zhu and Stevan Rudinac and David Bull and Fan Zhang},
year={2026},
eprint={2602.12173},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2602.12173},
}This repository is licensed under the Apache 2.0 License.
This project builds upon SAM, SAM2, SAM3, EdgeSAM, EdgeTAM, EfficientTAM, RepViT, TinyViT, EfficientViT, and MobileCLIP. Please refer to their respective licenses for usage terms.
We gratefully acknowledge the University of Bristol Isambard-AI supercomputer cluster for providing computational resources to this project. Special thanks to Dr. Fan Aaron Zhang for allocating resources and supporting this research.
Organizations and projects using EfficientSAM3:
Note: If you're using EfficientSAM3 in your work, please acknowledge us in your publications or projects. We're happy to promote your work here! Contact us to be featured in this section.


