Skip to content

SimonZeng7108/efficientsam3

Repository files navigation

EfficientSAM3: Progressive Hierarchical Knowledge Distillation from SAM1, 2 and 3

Chengxi Simon Zeng1,†, Yuxuan Jiang1, Gao Ge1, Shuai Wang2, Duolikun Danier3, Bin Zhu4, Stevan Rudinac2, David Bull1, Fan Aaron Zhang1

1Visual Information Lab, University of Bristol; 2MultiX lab, University of Amsterdam; 3University of Edinburgh; 4Singapore Management University

Tech Lead & Corresponding Author

arXiv arXiv Project Page Hugging Face Discord


Table of Contents

  1. Highlights
  2. Model Zoo
  3. Installation
  4. Quick Start
  5. Training and Evaluation
  6. Citations

Highlights

  • Efficient Vision Encoders: Distilled into RepViT, TinyViT, and EfficientViT families (22-28M params vs SAM3's 463M)
  • Efficient Text Encoders: Distilled into MobileCLIP variants (42-124M vs SAM3's 354M)
  • Full PCS Models: Image + text encoders distilled for promptable concept segmentation
  • LiteText Models: Keep SAM3 vision encoder, replace text encoder only

Model Zoo

EfficientSAM3 Full Models (Lightweight Image + Text Encoders)

EfficientSAM3 compresses both SAM3's vision encoder and text encoder into lightweight student models while maintaining competitive performance on downstream benchmarks.

Model Vision Text Decoder Other Params vs ImageSAM3 Download
EV-M 22.2M 42.5M 21.0M 3.5M 89.2M 90% smaller HF
RV-M 25.6M 42.5M 21.0M 3.5M 92.7M 89% smaller HF
TV-M 28.3M 42.5M 21.0M 3.5M 95.3M 89% smaller HF

Note: "Text" is the distilled text encoder. "Transformer" is the mask decoder. "Other" includes segmentation head + scoring. ImageSAM3 (for comparison): Vision: 463M + Text: 354M + Transformer: 30.3M + Other: 14.2M = 861.5M

SAM3-LiteText Models (Lightweight Text Encoder Only)

SAM3-LiteText keeps the SAM3 vision encoder but replaces the text encoder with lightweight MobileCLIP variants.

Model Vision Text Decoder Other Params vs ImageSAM3 Download
LiteText-S0-16 463.0M 42.5M 30.3M 14.2M 550.0M 36% smaller HF
LiteText-S0-32 463.0M 42.5M 30.3M 14.2M 550.0M 36% smaller HF
LiteText-S1-16 463.0M 63.5M 30.3M 14.2M 571.0M 34% smaller HF
LiteText-S1-32 463.0M 63.5M 30.3M 14.2M 571.0M 34% smaller HF
LiteText-L-16 463.0M 123.8M 30.3M 14.2M 631.3M 27% smaller HF
LiteText-L-32 463.0M 123.8M 30.3M 14.2M 631.3M 27% smaller HF

Note: "Text" is the distilled text encoder (42.5M-123.8M). SAM3-LiteText keeps SAM3's ViT-H vision encoder (~463M) but replaces the text encoder. "Other" includes geometry encoder + segmentation head + scoring.


Installation

git clone https://github.com/SimonZeng7108/efficientsam3
cd efficientsam3
pip install -e ".[stage1]"

Prerequisites:

  • Python 3.10+
  • PyTorch 2.0+
  • CUDA 11.8+ (for GPU support)

Quick Start

EfficientSAM3 (Full Models with Lightweight Encoders)

EfficientSAM3 replaces both the SAM3 vision encoder and text encoder with lightweight student models (EfficientViT/RepViT/TinyViT + MobileCLIP).

from sam3.model_builder import build_efficientsam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor
from PIL import Image

# Load EfficientSAM3 TV-M model (uses TinyViT vision encoder + MobileCLIP-S0 text encoder)
model = build_efficientsam3_image_model(
    checkpoint_path="efficientsam3_tinyvit.pt",
    backbone_type="tinyvit",
    model_name="11m",
    text_encoder_type="MobileCLIP-S0",
    text_encoder_context_length=16,
    load_from_HF=False,
)

# Process image
processor = Sam3Processor(model)
image = Image.open("your_image.jpg").convert("RGB")
state = processor.set_image(image)

# Text prompt segmentation
state = processor.set_text_prompt("dog", state)

# Get masks
masks = state["masks"]
scores = state["scores"]
print(f"Found {len(masks)} masks")

SAM3-LiteText

SAM3-LiteText keeps the SAM3 vision encoder but replaces the heavy text encoder with a lightweight MobileCLIP variant.

from sam3.model_builder import build_sam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor
from PIL import Image

# Build model with LiteText encoder (keeps SAM3 ViT, replaces text encoder)
model = build_sam3_image_model(
    checkpoint_path="sam3_litetext_mobileclip_s0_ctx16.pt",
    text_encoder_type="MobileCLIP-S0",
    text_encoder_context_length=16,
    load_from_HF=False,
)

# Use as normal
processor = Sam3Processor(model)
image = Image.open("your_image.jpg").convert("RGB")
state = processor.set_image(image)
state = processor.set_text_prompt("person", state)
masks = state["masks"]

Training and Evaluation

Training:

Evaluation:

  • To evaluate models on COCO dataset:

    python eval/eval_coco.py --coco_root data/coco --output_dir output
  • To evaluate text encoder quality (token-level cosine similarity vs SAM3 teacher):

    python eval/eval_text_encoder_similarity.py \
      --student-ckpt /path/to/student_text_encoder_1.pth /path/to/student_text_encoder_2.pth \
      --np-json data/sa-v-text/sa-co-veval/saco_veval_noun_phrases.json \
      --device cuda
    # Optional: override teacher checkpoint
    python eval/eval_text_encoder_similarity.py \
      --teacher-ckpt /path/to/teacher.pth \
      --student-ckpt /path/to/student.pth \
      --np-json data/sa-v-text/sa-co-veval/saco_veval_noun_phrases.json \
      --device cuda

Datasets

For dataset setup and download scripts (data/download_*.sh) covering COCO, DAVIS, LVIS, SA-1B, SA-V, LVOS, MOSE, and YouTube-VOS, see:


To-Do List

  • Release Stage 1 Image Encoder Weights: Distilled image encoder weights from SAM3 image encoder for all 9 variants (RepViT, TinyViT, EfficientViT)
  • Release Stage 1 Text Encoder Weights: Distill SAM3 text encoder weights to MobileCLIP-S1 combined with all 9 image encoder variants
  • Release Stage 1+ Fine-Tuned Encoder Weights: Prompt-in-the-loop supervised fine-tuning for improved encoder performance
  • Release SAM3-LiteText Weights: Distilled a lightweight MobileCLIP text encoder that is competitive to the SAM3 text encoder for efficient vision-language segmentation
  • Release Stage 2 Memory Bank Aligned Model Weights: Models with Perceiver-based memory compression trained on SA-V dataset
  • Release Stage 3 Fine-Tuned Model Weights: End-to-end fine-tuned models on SAM3 dataset with full PCS capabilities
  • ONNX/CoreML Export: Export models to ONNX and CoreML formats for cross-platform deployment
  • Web Demo: Interactive web demonstration for real-time concept segmentation and tracking

Call for Pull Requests

The idea for this repository originated from my work on SAM2 at Amazon, particularly as part of the research described in this paper. Since company policy, I cannot share the codebase. This year I am super excited to work on making SAM3 more efficient and accessible to the community.

We welcome contributions to EfficientSAM3! Please feel free to submit pull requests to improve the codebase, add new features, or fix bugs. Particularly, we are looking for:

All meaningful contributions will be acknowledged and integrated into both the repository and the associated paper. We warmly welcome all contributors to the repository and happily offer co-authorship to those whose work merits inclusion in the paper.


Citations

If you find EfficientSAM3 useful in your research, please cite:

@misc{zeng2025efficientsam3,
      title={EfficientSAM3: Progressive Hierarchical Distillation for Video Concept Segmentation from SAM1, 2, and 3},
      author={Chengxi Zeng and Yuxuan Jiang and Gao Ge and Shuai Wang and Duolikun Danier and Bin Zhu and Stevan Rudinac and David Bull and Fan Zhang},
      year={2025},
      eprint={2511.15833},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.15833},
}

@misc{zeng2026sam3litetext,
      title={SAM3-LiteText: An Anatomical Study of the SAM3 Text Encoder for Efficient Vision-Language Segmentation},
      author={Chengxi Zeng and Yuxuan Jiang and Gao Ge and Shuai Wang and Duolikun Danier and Bin Zhu and Stevan Rudinac and David Bull and Fan Zhang},
      year={2026},
      eprint={2602.12173},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2602.12173},
}

License

This repository is licensed under the Apache 2.0 License.

This project builds upon SAM, SAM2, SAM3, EdgeSAM, EdgeTAM, EfficientTAM, RepViT, TinyViT, EfficientViT, and MobileCLIP. Please refer to their respective licenses for usage terms.


Acknowledgments

We gratefully acknowledge the University of Bristol Isambard-AI supercomputer cluster for providing computational resources to this project. Special thanks to Dr. Fan Aaron Zhang for allocating resources and supporting this research.


Users

Organizations and projects using EfficientSAM3:

European Space Agency

Note: If you're using EfficientSAM3 in your work, please acknowledge us in your publications or projects. We're happy to promote your work here! Contact us to be featured in this section.

About

EfficientSAM3 compresses SAM3 into lightweight, edge-friendly models via progressive knowledge distillation for fast promptable concept segmentation and tracking.

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors