DINOv3-CLIP: Adapter & Segmentation

This repo contains two parts:

Adapter — a tiny MLP that maps frozen DINOv3 image embeddings into CLIP image space so they can be compared against CLIP text embeddings. Use it for zero-shot retrieval, scoring, and open vocab labeling.
Segmentation — a lightweight decoder over DINOv3 tokens plus CLIP-guided text fusion for open vocab semantic segmentation. Works on street scenes, and can be prompt filtered (e.g., --prompt "car").

About

✅ yes | ⚠️ partial | ❌ no

Method	Supervision used for training	Promptable	Open-vocab (unseen)	Closed-set	Needs text enc	Frozen backbone	Intended use	Notes
DINOv3-CLIP (ours)	Pixel masks + CLIP-image teacher (distill)	✅	⚠️	✅	✅	✅	Lightweight. Quick deploy. Small head	Frozen DINOv3 + tiny adapter + BN head. ~30 epochs. 0 aug in this run
DeepLabv3+	Pixel masks	❌	❌	✅	❌	⚠️	Strong closed-set baseline	Trains backbone + decoder
SegFormer	Pixel masks	❌	❌	✅	❌	⚠️	Efficient closed-set encoder–decoder	Transformer encoder
HRNet-OCR	Pixel masks	❌	❌	✅	❌	⚠️	High-quality closed-set	High res features. Strong miou with heavy training
BiSeNet-V2	Pixel masks	❌	❌	✅	❌	⚠️	Real-time closed-set	Very fast
Mask2Former	Pixel masks (+ extra data often)	❌	❌	✅	❌	⚠️	SOTA-ish closed-set with mask decoder	Query-based mask decoding
MaskDINO	Pixel masks (+ extra)	❌	❌	✅	❌	⚠️	SOTA-ish closed-set	DETR-style with mask queries
LSeg	CLIP alignment + pixel masks	✅	✅	⚠️	✅	⚠️	Open-vocab semantic seg	Text-aligned features
OpenSeg	CLIP/ALIGN features + masks/pseudo	✅	✅	⚠️	✅	⚠️	Open-vocab semantic seg at scale	Trained with large text supervision; strong zero-shot
MaskCLIP	CLIP + pseudo-masks	✅	✅	⚠️	✅	✅/⚠️	Prompt-driven zero-/few-shot seg	Often freezes CLIP
TCL / Talking-to-CLIP/DINO	Align SSL→CLIP (language guidance)	✅	✅	⚠️	✅	✅	Language-guided dense prediction	Bridges SSL (DINO) to text space

Install

(optional) new venv

python -m venv venv && source venv/bin/activate

install from source

pip install -e .

or install dependencies directly

pip install torch torchvision transformers open-clip-torch huggingface_hub safetensors tqdm Pillow pyyaml

Note: Use a recent transformers and pass trust_remote_code=True when loading DINOv3. DINOv3 models on HF are gated; request access and make sure your token allows public gated repos.

Adapter

What it is

Frozen DINOv3 (image in -> embedding out) + a trainable MLP adapter that maps to CLIP image space; CLIP text tower is used as-is at inference
This is NOT a new CLIP model and not DINOv3 fine-tuning. We did not train DINOv3 or CLIP
Adds zero shot text retrieval to DINOv3 without caption supervision
Trained with images only. No image–text pairs, so retrieval lags fully contrastively trained CLIP/SigLIP
Out of the box, this works with DINOv3 checkpoints from Hugging Face (gated) via transformers with trust_remote_code=True

Quickstart

Downloading weights

Direct download (wget/curl)

# wget
wget -O adapter.pt https://huggingface.co/duriantaco/dinov3clip/resolve/main/adapter.pt

# curl
curl -L -o adapter.pt https://huggingface.co/duriantaco/dinov3clip/resolve/main/adapter.pt

Git LFS

git lfs install
git clone https://huggingface.co/duriantaco/dinov3clip
cp dinov3clip/adapter.pt .

Python

from huggingface_hub import hf_hub_download
ckpt_path = hf_hub_download(repo_id="duriantaco/dinov3clip", filename="adapter.pt")
print("Saved to:", ckpt_path)

Note: Users don't need a token for the adapter itself, but they do need HF access (and HF_TOKEN) for the DINOv3 backbone weights at inference/training time

Inference

Inference with free-form prompts

Compare one image to a few natural-language hypotheses:

dinov3clip infer \
  --ckpt checkpoints/adapter.pt \
  --image /path/to/image.jpg \
  --texts "street with cars" "a person " "a dog" \
  --device cuda --topk 3 --softmax --temp 0.02 \
  --out out_infer.png # optional overlay

Open vocab labelling

dinov3clip label \
  --ckpt /data_1/dinov3clip/adapter.pt \
  --image /path/to/img.jpg \
  --device cuda --topk 5 --softmax --temp 0.02 \
  --out out.png

Text -> image retrieval grid:

python scripts/viz_text_to_image_grid.py \
  --ckpt checkpoints/adapter.pt \
  --images /path/to/images \
  --query "a city street with cars" \
  --topk 8 --cols 4 \
  --out assets/t2i_grid.png

Image -> top-k prompt scores

python scripts/viz_image_topk.py \
  --ckpt checkpoints/adapter.pt \
  --image /path/to/images \
  --texts "a city street with cars" "a dog on grass" "a person skiing" "a red bus" "night street" \
  --out assets/image_topk.png

Performance

Ours = DINOv3-B/16 (image) + Adapter (3.15M) + CLIP ViT-L/14 text (laion2b). We trained the adapter to match CLIP ViT-L/14 image space and at inference we only use the text tower from that same CLIP.

COCO Retrieval (5k images / 25k caps)

Model / Pipeline	i2t R@1	i2t R@5	i2t R@10	t2i R@1	t2i R@5	t2i R@10	Mean R@1
Ours: DINOv3-B/16 + Adapter + CLIP L/14 text (laion2b)	41.66	71.04	81.40	33.66	61.24	72.37	37.66
CLIP ViT-L/14 (openai, full image+text)	56.94	79.62	86.74	35.70	60.43	70.49	46.32
OpenCLIP ViT-H/14 (laion2b, full)	66.18	86.54	91.94	48.50	72.78	81.06	57.34
CLIP ViT-B/16 (openai, full)	49.28	73.44	81.84	30.18	54.43	65.32	39.73
SigLIP ViT-B/16-384 (webli, full)	67.62	87.98	92.76	49.40	73.78	81.95	58.51

Model size (footprint)

Pipeline	Params (M)	fp16	fp32
Ours: DINOv3-B/16 + Adapter + CLIP L/14 text	212.46	405.2 MB	810.5 MB
CLIP ViT-L/14 (full image+text)	427.62	815.6 MB	1.6 GB
OpenCLIP ViT-H/14 (full)	986.11	1.8 GB	3.7 GB
CLIP ViT-B/16 (full)	149.62	285.4 MB	570.8 MB
SigLIP ViT-B/16-384 (full)	203.45	388.0 MB	776.1 MB

Note: Our R@K is lower because we never train on image–text pairs. We only distill DINOv3 image features into CLIP's image space with a tiny MLP. Baselines (CLIP/SigLIP) are trained end to end on massive paired data.

Segmentation

What it is

A small segmentation head over DINOv3 patch tokens plus CLIP guided text fusion at inference for open vocab behavior. We ship a Cityscapes-trained checkpoint. Prompts can highlight classes you want

Backbone: DINOv3 (frozen at infer)
Adapter: Pixel-level adapter to CLIP space
Head: Multi-branch BN head + upsampling
Text fusion: cosine similarity blended with supervised logits

Quick start

Single image (no prompt):

dinov3clip seg \
  /path/to/input/image \
  --ckpt ckpt /path/to/checkpoint \
  --out out/seg_leverkusen_000055.png

Single image, prompt-filtered (highlight cars only):

dinov3clip seg \
  /path/to/input/image \
  --ckpt /path/to/checkpoint \
  --out out/car.png \
  --prompt "car"

Directory (recursively process all images):

dinov3clip seg \
  /path/to/input/dir \
  --ckpt /path/to/checkpoint

Note: --prompt accepts comma separated class names. e.g. --prompt "car, bus, person". We match prompts against the cityscape labels under the hood. Next, we fuse text logits with supervised logits, collapse non selected classes, and last, overlay only the wanted classes. This gives a clean highlight for the chosen categories instead of coloring everything. If you want text only, set the alpha in code to 1.0

Model size (footprint)

Pipeline	Params (M)	fp16	fp32
Seg-Only	97.66	186.3 MB	372.6 GB
Dinov3clip (full)	525.28	1001.9 MB	2003.8 MB
CLIP ViT-L/14 (full)	427.62	815.6 MB	1631.2 MB

How it works

DINOv3 is a self-supervised vision foundation model (ViT family). It takes an image and outputs a global embedding (plus dense tokens), no text encoder included.
CLIP is a contrastive image–text model. We only use its text tower at inference and its image tower as a teacher during training.
We froze both DINOv3 and CLIP, and train a small MLP adapter so that adapter(DINOv3(img)) matches CLIP_image(img). After training, compare the adapted image embedding to CLIP text embeddings for zero-shot retrieval.

Results

Closed-set results (single-scale unless noted)

Method	Backbone	Dataset	mIoU	fwIoU	PixelAcc	Notes
DINOv3-CLIP (ours)	DINOv3-B/16 + Adapter + BNHead	CamVid-11 (test)	0.707	0.873	0.930	min-side=896, EMA, closed-set head
Dilation8	VGG-16	CamVid-11 (test)	0.653	—	—	dilated CNN
BiSeNet	ResNet-18	CamVid-11 (test)	0.687	—	—	real time baseline
VideoGCRF	ResNet-101	CamVid-11 (test)	0.752	—	—	spatio-temporal CRF
RTFormer-Base	RTFormer	CamVid-11 (test)	0.825	—	—	larger variant
SegFormer-B5	MiT-B5	CamVid-11 (test)	0.837	—	—	transformer encoder–decoder

NOTE PLEASE READ: This is NOT a pure segmentation model. During training there is no augmentation done and we made 0 architectural changes (adapter+BNHead only). Backbones frozen.

Performance & caveats

GPU strongly recommended. DINOv3 + CLIP text on CPU is slow. If CUDA isn’t available, we fall back to CPU.
This adapter is trained from CLIP image features only. NO text during training. Labels that are far from your training image domain can still get pretty high scores (because CLIP’s text space is broad). The ranking however, is usually sensible for in-domain images.
On single images with a very long label list, softmax+low temperature often yields easier-to-read overlays.
Why this isn’t apples-to-apples with big seg models:
1. Backbones: we keep DINOv3 and CLIP frozen. Most SOTA seg models train the whole backbone.
2. Head size / schedule: small BN head + ~30 epochs vs. heavy decoders and long schedules (100–160k iters).
3. Augmentations: None
4. Objective: we also support promptable / open-vocab behavior (text fusion). Classic seg models are closed-set only.

Troubleshooting

403/401 on DINOv3 downloads: Your HF token needs "public gated repositories" permission and you must accept DINOv3's terms on the model card. Set HF_TOKEN in the env. Hugging Face
model type dinov3_vit not recognized: upgrade transformers and load with trust_remote_code=True.
Slow training -> confirm GPU util; raise batch_size until VRAM is utilized, increase workers/prefetch, and ensure TF32 is enabled.
pip install git+https://github.com/lucasb-eyer/pydensecrf.git@master

License & weights

Dinov3-CLIP Adapter: Apache 2.0
DINOv3 weights: released under the DINOv3 License by Meta; weights are gated and you generally should not redistribute them. Instruct users to request/download via Hugging Face or Meta’s portal. Hugging Face
OpenCLIP/CLIP: follow their respective licenses.
Datasets used for training:
- Cityscapes — dataset under the Cityscapes license (non-commercial research; additional restrictions apply). You must obtain access and accept their terms.
- COCO — annotations released under a permissive license; images are provided by their owners under a variety of Creative Commons licenses. Use requires compliance with COCO's terms.

If you redistribute models trained on these datasets, ensure your use complies with those dataset licenses.

To-do

[x] Segmentation head: add a light decoder over DINOv3 tokens and train with pseudo-labels [] Better calibration: learn a small temp or bias layer for more probability-like scores [] Prompt ensembling: expand / learn templates per class [] Train with more pictures for segmentation [] Train with new architecture

Related work

Alignment between self-supervised vision backbones (like DINO/DINOv2) and language (CLIP) via learned mappings has appeared in the literature (e.g., Talking to DINO for open-vocabulary segmentation), which is conceptually similar to this adapter idea.

A note on expectations

The adapter never sees text during training. It distills DINOv3 image features into CLIP's image space only. That makes it compact and simple to train. However, the absolute retrieval numbers trail end-to-end CLIP/SigLIP. For highlighting semantics (retrieval, quick tagging, and prompt-filtered segmentation) it is practical and easy to deploy.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
assets		assets
dinov3clip		dinov3clip
scripts		scripts
.gitignore		.gitignore
License		License
README.md		README.md
pyproject.toml		pyproject.toml

License

duriantaco/dinov3clip

Folders and files

Latest commit

History

Repository files navigation

DINOv3-CLIP: Adapter & Segmentation

Contents

About

Install

(optional) new venv

install from source

or install dependencies directly

Adapter

What it is

Quickstart

Downloading weights

Direct download (wget/curl)

Git LFS

Python

Inference

Performance

COCO Retrieval (5k images / 25k caps)

Model size (footprint)

Segmentation

What it is

Quick start

Single image (no prompt):

Single image, prompt-filtered (highlight cars only):

Directory (recursively process all images):

Model size (footprint)

How it works

Results

Closed-set results (single-scale unless noted)

Performance & caveats

Troubleshooting

License & weights

To-do

Related work

A note on expectations

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages