Skip to content

Caption free adapter that maps DINOv3 image embeddings into CLIP space so you can do zero-shot text -> image or image -> text with CLIP’s text tower

License

Notifications You must be signed in to change notification settings

duriantaco/dinov3clip

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DINOv3-CLIP: Adapter & Segmentation

This repo contains two parts:

  1. Adapter — a tiny MLP that maps frozen DINOv3 image embeddings into CLIP image space so they can be compared against CLIP text embeddings. Use it for zero-shot retrieval, scoring, and open vocab labeling.

  2. Segmentation — a lightweight decoder over DINOv3 tokens plus CLIP-guided text fusion for open vocab semantic segmentation. Works on street scenes, and can be prompt filtered (e.g., --prompt "car").

final_stitched

Contents

About

✅ yes | ⚠️ partial | ❌ no

Method Supervision used for training Promptable Open-vocab (unseen) Closed-set Needs text enc Frozen backbone Intended use Notes
DINOv3-CLIP (ours) Pixel masks + CLIP-image teacher (distill) ⚠️ Lightweight. Quick deploy. Small head Frozen DINOv3 + tiny adapter + BN head. ~30 epochs. 0 aug in this run
DeepLabv3+ Pixel masks ⚠️ Strong closed-set baseline Trains backbone + decoder
SegFormer Pixel masks ⚠️ Efficient closed-set encoder–decoder Transformer encoder
HRNet-OCR Pixel masks ⚠️ High-quality closed-set High res features. Strong miou with heavy training
BiSeNet-V2 Pixel masks ⚠️ Real-time closed-set Very fast
Mask2Former Pixel masks (+ extra data often) ⚠️ SOTA-ish closed-set with mask decoder Query-based mask decoding
MaskDINO Pixel masks (+ extra) ⚠️ SOTA-ish closed-set DETR-style with mask queries
LSeg CLIP alignment + pixel masks ⚠️ ⚠️ Open-vocab semantic seg Text-aligned features
OpenSeg CLIP/ALIGN features + masks/pseudo ⚠️ ⚠️ Open-vocab semantic seg at scale Trained with large text supervision; strong zero-shot
MaskCLIP CLIP + pseudo-masks ⚠️ ✅/⚠️ Prompt-driven zero-/few-shot seg Often freezes CLIP
TCL / Talking-to-CLIP/DINO Align SSL→CLIP (language guidance) ⚠️ Language-guided dense prediction Bridges SSL (DINO) to text space

Install

(optional) new venv

python -m venv venv && source venv/bin/activate

install from source

pip install -e .

or install dependencies directly

pip install torch torchvision transformers open-clip-torch huggingface_hub safetensors tqdm Pillow pyyaml

Note: Use a recent transformers and pass trust_remote_code=True when loading DINOv3. DINOv3 models on HF are gated; request access and make sure your token allows public gated repos.

Adapter

What it is

  • Frozen DINOv3 (image in -> embedding out) + a trainable MLP adapter that maps to CLIP image space; CLIP text tower is used as-is at inference
  • This is NOT a new CLIP model and not DINOv3 fine-tuning. We did not train DINOv3 or CLIP
  • Adds zero shot text retrieval to DINOv3 without caption supervision
  • Trained with images only. No image–text pairs, so retrieval lags fully contrastively trained CLIP/SigLIP
  • Out of the box, this works with DINOv3 checkpoints from Hugging Face (gated) via transformers with trust_remote_code=True

Quickstart

Downloading weights

Direct download (wget/curl)
# wget
wget -O adapter.pt https://huggingface.co/duriantaco/dinov3clip/resolve/main/adapter.pt
# curl
curl -L -o adapter.pt https://huggingface.co/duriantaco/dinov3clip/resolve/main/adapter.pt
Git LFS
git lfs install
git clone https://huggingface.co/duriantaco/dinov3clip
cp dinov3clip/adapter.pt .
Python
from huggingface_hub import hf_hub_download
ckpt_path = hf_hub_download(repo_id="duriantaco/dinov3clip", filename="adapter.pt")
print("Saved to:", ckpt_path)

Note: Users don't need a token for the adapter itself, but they do need HF access (and HF_TOKEN) for the DINOv3 backbone weights at inference/training time

Inference

  1. Inference with free-form prompts

Compare one image to a few natural-language hypotheses:

dinov3clip infer \
  --ckpt checkpoints/adapter.pt \
  --image /path/to/image.jpg \
  --texts "street with cars" "a person " "a dog" \
  --device cuda --topk 3 --softmax --temp 0.02 \
  --out out_infer.png # optional overlay
  1. Open vocab labelling
dinov3clip label \
  --ckpt /data_1/dinov3clip/adapter.pt \
  --image /path/to/img.jpg \
  --device cuda --topk 5 --softmax --temp 0.02 \
  --out out.png
  1. Text -> image retrieval grid:
python scripts/viz_text_to_image_grid.py \
  --ckpt checkpoints/adapter.pt \
  --images /path/to/images \
  --query "a city street with cars" \
  --topk 8 --cols 4 \
  --out assets/t2i_grid.png
  1. Image -> top-k prompt scores
python scripts/viz_image_topk.py \
  --ckpt checkpoints/adapter.pt \
  --image /path/to/images \
  --texts "a city street with cars" "a dog on grass" "a person skiing" "a red bus" "night street" \
  --out assets/image_topk.png

Performance

Ours = DINOv3-B/16 (image) + Adapter (3.15M) + CLIP ViT-L/14 text (laion2b). We trained the adapter to match CLIP ViT-L/14 image space and at inference we only use the text tower from that same CLIP.

COCO Retrieval (5k images / 25k caps)

Model / Pipeline i2t R@1 i2t R@5 i2t R@10 t2i R@1 t2i R@5 t2i R@10 Mean R@1
Ours: DINOv3-B/16 + Adapter + CLIP L/14 text (laion2b) 41.66 71.04 81.40 33.66 61.24 72.37 37.66
CLIP ViT-L/14 (openai, full image+text) 56.94 79.62 86.74 35.70 60.43 70.49 46.32
OpenCLIP ViT-H/14 (laion2b, full) 66.18 86.54 91.94 48.50 72.78 81.06 57.34
CLIP ViT-B/16 (openai, full) 49.28 73.44 81.84 30.18 54.43 65.32 39.73
SigLIP ViT-B/16-384 (webli, full) 67.62 87.98 92.76 49.40 73.78 81.95 58.51

Model size (footprint)

Pipeline Params (M) fp16 fp32
Ours: DINOv3-B/16 + Adapter + CLIP L/14 text 212.46 405.2 MB 810.5 MB
CLIP ViT-L/14 (full image+text) 427.62 815.6 MB 1.6 GB
OpenCLIP ViT-H/14 (full) 986.11 1.8 GB 3.7 GB
CLIP ViT-B/16 (full) 149.62 285.4 MB 570.8 MB
SigLIP ViT-B/16-384 (full) 203.45 388.0 MB 776.1 MB

Note: Our R@K is lower because we never train on image–text pairs. We only distill DINOv3 image features into CLIP's image space with a tiny MLP. Baselines (CLIP/SigLIP) are trained end to end on massive paired data.

Segmentation

stitched_seg

What it is

A small segmentation head over DINOv3 patch tokens plus CLIP guided text fusion at inference for open vocab behavior. We ship a Cityscapes-trained checkpoint. Prompts can highlight classes you want

  • Backbone: DINOv3 (frozen at infer)
  • Adapter: Pixel-level adapter to CLIP space
  • Head: Multi-branch BN head + upsampling
  • Text fusion: cosine similarity blended with supervised logits

Quick start

Single image (no prompt):

dinov3clip seg \
  /path/to/input/image \
  --ckpt ckpt /path/to/checkpoint \
  --out out/seg_leverkusen_000055.png

Single image, prompt-filtered (highlight cars only):

dinov3clip seg \
  /path/to/input/image \
  --ckpt /path/to/checkpoint \
  --out out/car.png \
  --prompt "car"

Directory (recursively process all images):

dinov3clip seg \
  /path/to/input/dir \
  --ckpt /path/to/checkpoint

Note: --prompt accepts comma separated class names. e.g. --prompt "car, bus, person". We match prompts against the cityscape labels under the hood. Next, we fuse text logits with supervised logits, collapse non selected classes, and last, overlay only the wanted classes. This gives a clean highlight for the chosen categories instead of coloring everything. If you want text only, set the alpha in code to 1.0

Model size (footprint)

Pipeline Params (M) fp16 fp32
Seg-Only 97.66 186.3 MB 372.6 GB
Dinov3clip (full) 525.28 1001.9 MB 2003.8 MB
CLIP ViT-L/14 (full) 427.62 815.6 MB 1631.2 MB

How it works

  • DINOv3 is a self-supervised vision foundation model (ViT family). It takes an image and outputs a global embedding (plus dense tokens), no text encoder included.

  • CLIP is a contrastive image–text model. We only use its text tower at inference and its image tower as a teacher during training.

  • We froze both DINOv3 and CLIP, and train a small MLP adapter so that adapter(DINOv3(img)) matches CLIP_image(img). After training, compare the adapted image embedding to CLIP text embeddings for zero-shot retrieval.

Results

Closed-set results (single-scale unless noted)

Method Backbone Dataset mIoU fwIoU PixelAcc Notes
DINOv3-CLIP (ours) DINOv3-B/16 + Adapter + BNHead CamVid-11 (test) 0.707 0.873 0.930 min-side=896, EMA, closed-set head
Dilation8 VGG-16 CamVid-11 (test) 0.653 dilated CNN
BiSeNet ResNet-18 CamVid-11 (test) 0.687 real time baseline
VideoGCRF ResNet-101 CamVid-11 (test) 0.752 spatio-temporal CRF
RTFormer-Base RTFormer CamVid-11 (test) 0.825 larger variant
SegFormer-B5 MiT-B5 CamVid-11 (test) 0.837 transformer encoder–decoder

NOTE PLEASE READ: This is NOT a pure segmentation model. During training there is no augmentation done and we made 0 architectural changes (adapter+BNHead only). Backbones frozen.

Performance & caveats

  • GPU strongly recommended. DINOv3 + CLIP text on CPU is slow. If CUDA isn’t available, we fall back to CPU.

  • This adapter is trained from CLIP image features only. NO text during training. Labels that are far from your training image domain can still get pretty high scores (because CLIP’s text space is broad). The ranking however, is usually sensible for in-domain images.

  • On single images with a very long label list, softmax+low temperature often yields easier-to-read overlays.

  • Why this isn’t apples-to-apples with big seg models:

    1. Backbones: we keep DINOv3 and CLIP frozen. Most SOTA seg models train the whole backbone.
    2. Head size / schedule: small BN head + ~30 epochs vs. heavy decoders and long schedules (100–160k iters).
    3. Augmentations: None
    4. Objective: we also support promptable / open-vocab behavior (text fusion). Classic seg models are closed-set only.

Troubleshooting

  1. 403/401 on DINOv3 downloads: Your HF token needs "public gated repositories" permission and you must accept DINOv3's terms on the model card. Set HF_TOKEN in the env. Hugging Face

  2. model type dinov3_vit not recognized: upgrade transformers and load with trust_remote_code=True.

  3. Slow training -> confirm GPU util; raise batch_size until VRAM is utilized, increase workers/prefetch, and ensure TF32 is enabled.

  4. pip install git+https://github.com/lucasb-eyer/pydensecrf.git@master

License & weights

  • Dinov3-CLIP Adapter: Apache 2.0

  • DINOv3 weights: released under the DINOv3 License by Meta; weights are gated and you generally should not redistribute them. Instruct users to request/download via Hugging Face or Meta’s portal. Hugging Face

  • OpenCLIP/CLIP: follow their respective licenses.

  • Datasets used for training:

    • Cityscapes — dataset under the Cityscapes license (non-commercial research; additional restrictions apply). You must obtain access and accept their terms.

    • COCO — annotations released under a permissive license; images are provided by their owners under a variety of Creative Commons licenses. Use requires compliance with COCO's terms.

If you redistribute models trained on these datasets, ensure your use complies with those dataset licenses.

To-do

[x] Segmentation head: add a light decoder over DINOv3 tokens and train with pseudo-labels [] Better calibration: learn a small temp or bias layer for more probability-like scores [] Prompt ensembling: expand / learn templates per class [] Train with more pictures for segmentation [] Train with new architecture

Related work

Alignment between self-supervised vision backbones (like DINO/DINOv2) and language (CLIP) via learned mappings has appeared in the literature (e.g., Talking to DINO for open-vocabulary segmentation), which is conceptually similar to this adapter idea.

A note on expectations

The adapter never sees text during training. It distills DINOv3 image features into CLIP's image space only. That makes it compact and simple to train. However, the absolute retrieval numbers trail end-to-end CLIP/SigLIP. For highlighting semantics (retrieval, quick tagging, and prompt-filtered segmentation) it is practical and easy to deploy.

About

Caption free adapter that maps DINOv3 image embeddings into CLIP space so you can do zero-shot text -> image or image -> text with CLIP’s text tower

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages