This repo contains two parts:
-
Adapter — a tiny MLP that maps frozen DINOv3 image embeddings into CLIP image space so they can be compared against CLIP text embeddings. Use it for zero-shot retrieval, scoring, and open vocab labeling.
-
Segmentation — a lightweight decoder over DINOv3 tokens plus CLIP-guided text fusion for open vocab semantic segmentation. Works on street scenes, and can be prompt filtered (e.g., --prompt "car").
✅ yes |
⚠️ partial | ❌ no
| Method | Supervision used for training | Promptable | Open-vocab (unseen) | Closed-set | Needs text enc | Frozen backbone | Intended use | Notes |
|---|---|---|---|---|---|---|---|---|
| DINOv3-CLIP (ours) | Pixel masks + CLIP-image teacher (distill) | ✅ | ✅ | ✅ | ✅ | Lightweight. Quick deploy. Small head | Frozen DINOv3 + tiny adapter + BN head. ~30 epochs. 0 aug in this run | |
| DeepLabv3+ | Pixel masks | ❌ | ❌ | ✅ | ❌ | Strong closed-set baseline | Trains backbone + decoder | |
| SegFormer | Pixel masks | ❌ | ❌ | ✅ | ❌ | Efficient closed-set encoder–decoder | Transformer encoder | |
| HRNet-OCR | Pixel masks | ❌ | ❌ | ✅ | ❌ | High-quality closed-set | High res features. Strong miou with heavy training | |
| BiSeNet-V2 | Pixel masks | ❌ | ❌ | ✅ | ❌ | Real-time closed-set | Very fast | |
| Mask2Former | Pixel masks (+ extra data often) | ❌ | ❌ | ✅ | ❌ | SOTA-ish closed-set with mask decoder | Query-based mask decoding | |
| MaskDINO | Pixel masks (+ extra) | ❌ | ❌ | ✅ | ❌ | SOTA-ish closed-set | DETR-style with mask queries | |
| LSeg | CLIP alignment + pixel masks | ✅ | ✅ | ✅ | Open-vocab semantic seg | Text-aligned features | ||
| OpenSeg | CLIP/ALIGN features + masks/pseudo | ✅ | ✅ | ✅ | Open-vocab semantic seg at scale | Trained with large text supervision; strong zero-shot | ||
| MaskCLIP | CLIP + pseudo-masks | ✅ | ✅ | ✅ | ✅/ |
Prompt-driven zero-/few-shot seg | Often freezes CLIP | |
| TCL / Talking-to-CLIP/DINO | Align SSL→CLIP (language guidance) | ✅ | ✅ | ✅ | ✅ | Language-guided dense prediction | Bridges SSL (DINO) to text space |
python -m venv venv && source venv/bin/activate
pip install -e .
pip install torch torchvision transformers open-clip-torch huggingface_hub safetensors tqdm Pillow pyyaml
Note: Use a recent transformers and pass trust_remote_code=True when loading DINOv3. DINOv3 models on HF are gated; request access and make sure your token allows public gated repos.
- Frozen DINOv3 (image in -> embedding out) + a trainable MLP adapter that maps to CLIP image space; CLIP text tower is used as-is at inference
- This is NOT a new CLIP model and not DINOv3 fine-tuning. We did not train DINOv3 or CLIP
- Adds zero shot text retrieval to DINOv3 without caption supervision
- Trained with images only. No image–text pairs, so retrieval lags fully contrastively trained CLIP/SigLIP
- Out of the box, this works with DINOv3 checkpoints from Hugging Face (gated) via transformers with
trust_remote_code=True
# wget
wget -O adapter.pt https://huggingface.co/duriantaco/dinov3clip/resolve/main/adapter.pt# curl
curl -L -o adapter.pt https://huggingface.co/duriantaco/dinov3clip/resolve/main/adapter.ptgit lfs install
git clone https://huggingface.co/duriantaco/dinov3clip
cp dinov3clip/adapter.pt .from huggingface_hub import hf_hub_download
ckpt_path = hf_hub_download(repo_id="duriantaco/dinov3clip", filename="adapter.pt")
print("Saved to:", ckpt_path)Note: Users don't need a token for the adapter itself, but they do need HF access (and HF_TOKEN) for the DINOv3 backbone weights at inference/training time
- Inference with free-form prompts
Compare one image to a few natural-language hypotheses:
dinov3clip infer \
--ckpt checkpoints/adapter.pt \
--image /path/to/image.jpg \
--texts "street with cars" "a person " "a dog" \
--device cuda --topk 3 --softmax --temp 0.02 \
--out out_infer.png # optional overlay- Open vocab labelling
dinov3clip label \
--ckpt /data_1/dinov3clip/adapter.pt \
--image /path/to/img.jpg \
--device cuda --topk 5 --softmax --temp 0.02 \
--out out.png- Text -> image retrieval grid:
python scripts/viz_text_to_image_grid.py \
--ckpt checkpoints/adapter.pt \
--images /path/to/images \
--query "a city street with cars" \
--topk 8 --cols 4 \
--out assets/t2i_grid.png- Image -> top-k prompt scores
python scripts/viz_image_topk.py \
--ckpt checkpoints/adapter.pt \
--image /path/to/images \
--texts "a city street with cars" "a dog on grass" "a person skiing" "a red bus" "night street" \
--out assets/image_topk.pngOurs = DINOv3-B/16 (image) + Adapter (3.15M) + CLIP ViT-L/14 text (laion2b). We trained the adapter to match CLIP ViT-L/14 image space and at inference we only use the text tower from that same CLIP.
| Model / Pipeline | i2t R@1 | i2t R@5 | i2t R@10 | t2i R@1 | t2i R@5 | t2i R@10 | Mean R@1 |
|---|---|---|---|---|---|---|---|
| Ours: DINOv3-B/16 + Adapter + CLIP L/14 text (laion2b) | 41.66 | 71.04 | 81.40 | 33.66 | 61.24 | 72.37 | 37.66 |
| CLIP ViT-L/14 (openai, full image+text) | 56.94 | 79.62 | 86.74 | 35.70 | 60.43 | 70.49 | 46.32 |
| OpenCLIP ViT-H/14 (laion2b, full) | 66.18 | 86.54 | 91.94 | 48.50 | 72.78 | 81.06 | 57.34 |
| CLIP ViT-B/16 (openai, full) | 49.28 | 73.44 | 81.84 | 30.18 | 54.43 | 65.32 | 39.73 |
| SigLIP ViT-B/16-384 (webli, full) | 67.62 | 87.98 | 92.76 | 49.40 | 73.78 | 81.95 | 58.51 |
| Pipeline | Params (M) | fp16 | fp32 |
|---|---|---|---|
| Ours: DINOv3-B/16 + Adapter + CLIP L/14 text | 212.46 | 405.2 MB | 810.5 MB |
| CLIP ViT-L/14 (full image+text) | 427.62 | 815.6 MB | 1.6 GB |
| OpenCLIP ViT-H/14 (full) | 986.11 | 1.8 GB | 3.7 GB |
| CLIP ViT-B/16 (full) | 149.62 | 285.4 MB | 570.8 MB |
| SigLIP ViT-B/16-384 (full) | 203.45 | 388.0 MB | 776.1 MB |
Note: Our R@K is lower because we never train on image–text pairs. We only distill DINOv3 image features into CLIP's image space with a tiny MLP. Baselines (CLIP/SigLIP) are trained end to end on massive paired data.
A small segmentation head over DINOv3 patch tokens plus CLIP guided text fusion at inference for open vocab behavior. We ship a Cityscapes-trained checkpoint. Prompts can highlight classes you want
- Backbone: DINOv3 (frozen at infer)
- Adapter: Pixel-level adapter to CLIP space
- Head: Multi-branch BN head + upsampling
- Text fusion: cosine similarity blended with supervised logits
dinov3clip seg \
/path/to/input/image \
--ckpt ckpt /path/to/checkpoint \
--out out/seg_leverkusen_000055.pngdinov3clip seg \
/path/to/input/image \
--ckpt /path/to/checkpoint \
--out out/car.png \
--prompt "car"dinov3clip seg \
/path/to/input/dir \
--ckpt /path/to/checkpointNote: --prompt accepts comma separated class names. e.g. --prompt "car, bus, person". We match prompts against the cityscape labels under the hood. Next, we fuse text logits with supervised logits, collapse non selected classes, and last, overlay only the wanted classes. This gives a clean highlight for the chosen categories instead of coloring everything. If you want text only, set the alpha in code to 1.0
| Pipeline | Params (M) | fp16 | fp32 |
|---|---|---|---|
| Seg-Only | 97.66 | 186.3 MB | 372.6 GB |
| Dinov3clip (full) | 525.28 | 1001.9 MB | 2003.8 MB |
| CLIP ViT-L/14 (full) | 427.62 | 815.6 MB | 1631.2 MB |
-
DINOv3 is a self-supervised vision foundation model (ViT family). It takes an image and outputs a global embedding (plus dense tokens), no text encoder included.
-
CLIP is a contrastive image–text model. We only use its text tower at inference and its image tower as a teacher during training.
-
We froze both DINOv3 and CLIP, and train a small MLP adapter so that adapter(DINOv3(img)) matches CLIP_image(img). After training, compare the adapted image embedding to CLIP text embeddings for zero-shot retrieval.
| Method | Backbone | Dataset | mIoU | fwIoU | PixelAcc | Notes |
|---|---|---|---|---|---|---|
| DINOv3-CLIP (ours) | DINOv3-B/16 + Adapter + BNHead | CamVid-11 (test) | 0.707 | 0.873 | 0.930 | min-side=896, EMA, closed-set head |
| Dilation8 | VGG-16 | CamVid-11 (test) | 0.653 | — | — | dilated CNN |
| BiSeNet | ResNet-18 | CamVid-11 (test) | 0.687 | — | — | real time baseline |
| VideoGCRF | ResNet-101 | CamVid-11 (test) | 0.752 | — | — | spatio-temporal CRF |
| RTFormer-Base | RTFormer | CamVid-11 (test) | 0.825 | — | — | larger variant |
| SegFormer-B5 | MiT-B5 | CamVid-11 (test) | 0.837 | — | — | transformer encoder–decoder |
NOTE PLEASE READ: This is NOT a pure segmentation model. During training there is no augmentation done and we made 0 architectural changes (adapter+BNHead only). Backbones frozen.
-
GPU strongly recommended. DINOv3 + CLIP text on CPU is slow. If CUDA isn’t available, we fall back to CPU.
-
This adapter is trained from CLIP image features only. NO text during training. Labels that are far from your training image domain can still get pretty high scores (because CLIP’s text space is broad). The ranking however, is usually sensible for in-domain images.
-
On single images with a very long label list, softmax+low temperature often yields easier-to-read overlays.
-
Why this isn’t apples-to-apples with big seg models:
- Backbones: we keep DINOv3 and CLIP frozen. Most SOTA seg models train the whole backbone.
- Head size / schedule: small BN head + ~30 epochs vs. heavy decoders and long schedules (100–160k iters).
- Augmentations: None
- Objective: we also support promptable / open-vocab behavior (text fusion). Classic seg models are closed-set only.
-
403/401 on DINOv3 downloads: Your HF token needs "public gated repositories" permission and you must accept DINOv3's terms on the model card. Set
HF_TOKENin the env. Hugging Face -
model type dinov3_vit not recognized: upgrade transformers and load with
trust_remote_code=True. -
Slow training -> confirm GPU util; raise batch_size until VRAM is utilized, increase workers/prefetch, and ensure TF32 is enabled.
-
pip install git+https://github.com/lucasb-eyer/pydensecrf.git@master
-
Dinov3-CLIP Adapter: Apache 2.0
-
DINOv3 weights: released under the DINOv3 License by Meta; weights are gated and you generally should not redistribute them. Instruct users to request/download via Hugging Face or Meta’s portal. Hugging Face
-
OpenCLIP/CLIP: follow their respective licenses.
-
Datasets used for training:
-
Cityscapes — dataset under the Cityscapes license (non-commercial research; additional restrictions apply). You must obtain access and accept their terms.
-
COCO — annotations released under a permissive license; images are provided by their owners under a variety of Creative Commons licenses. Use requires compliance with COCO's terms.
-
If you redistribute models trained on these datasets, ensure your use complies with those dataset licenses.
[x] Segmentation head: add a light decoder over DINOv3 tokens and train with pseudo-labels [] Better calibration: learn a small temp or bias layer for more probability-like scores [] Prompt ensembling: expand / learn templates per class [] Train with more pictures for segmentation [] Train with new architecture
Alignment between self-supervised vision backbones (like DINO/DINOv2) and language (CLIP) via learned mappings has appeared in the literature (e.g., Talking to DINO for open-vocabulary segmentation), which is conceptually similar to this adapter idea.
The adapter never sees text during training. It distills DINOv3 image features into CLIP's image space only. That makes it compact and simple to train. However, the absolute retrieval numbers trail end-to-end CLIP/SigLIP. For highlighting semantics (retrieval, quick tagging, and prompt-filtered segmentation) it is practical and easy to deploy.

