Matic Fučka1 # Vitjan Zavrtanik1,2 # Danijel Skočaj1
1 University of Ljubljana, 2 *codeplain.
# Equal contribution
Overview | Get Started | Results | Citation
Apr. 9th, 2026: The models and training code for AnomalyVFM have been organized and uploaded.Feb. 20th, 2026: AnomalyVFM has been accepted to CVPR 2026 🔥🔥🔥
- Release the base code, pretrained weights and dataset
- Upload dataset to HuggingFace
- Upload weights to HuggingFace
- Release the 3D Zero-shot Extension - Code & Paper (Goal: ~July 2026)
- Release the Full-shot Extension - Code & Paper (Goal: ~November 2026)
- Synthetic Dataset Generation Anomaly-free images are created using an image generation model and then modified via inpainting to produce anomalous versions within a targeted region. Corresponding masks are generated by comparing feature-level differences between the normal and anomalous images, which also serves to filter out samples where the defect failed to generate.
- AnomalyVFM AnomalyVFM adapts a pretrained backbone by injecting LoRA-based feature adaptation modules into the transformer attention layers. This refines internal representations for anomaly detection. It utilizes a convolutional decoder and a confidence-weighted loss to generate segmentation masks, a combination specifically designed to remain robust against noise in synthetic training labels.
Note
Some small details differ from the paper as we found a more stable hyperparameter configuration for both synthetic dataset generation and model training. Additionally this setup currently works better for planned future extensions and when applied to new datasets outside the original paper. It is advised to use the new configuration as it trains faster and more stably. The changes will be presented in a follow-up paper.
Step 1: Clone the repository:
Clone this repository and navigate to the project directory:
git clone https://github.com/MaticFuc/AnomalyVFM.git
cd AnomalyVFMStep 2: Environment Setup:
It is recommended to set up a conda environment and installing dependencies via pip. Use the following commands to set up your environment:
Create and activate a new conda environment
conda create -n anomalyvfm python=3.10 pip
conda activate anomalyvfmInstall dependencies
cd flux && pip install -e .
cd ../flux2 && pip install -e .
cd .. && pip install -r requirements.txtCurrent setup uses FLUX.1-krea [dev] and DINOv3 ViT-L/16. A dataset generated with FLUX.2 will be released in the future.
Before running also download the Foreground segmentor and move it to pretrained_models. Additionally, the DINOv3 repository should be cloned (the path to the repo should be added to models/dinov3.py). Additionally, the ViT-L model should be added to pretrained_models. For other methods it works out of the box.
python generate_dataset.py --mode generate --n-img 10000
python generate_dataset.py --mode generate_anom --n-img 10000
python generate_dataset.py --mode filter --n-img 10000Optional arguments
--img-gen-model: Choose the image generator used to generate images (select fromflux,flux2,qwen_image,zimage). By defaultflux.--object-data: Choose the Object data used to generate the images. Onlydefaultnow.--image-size: Choose the resolution of the generated images. Tested only on 1024x1024.--n-img: Amount of images before the dataset filtering. By default 1.--out-path: Choose the output path for the generated images.--seed: Choose the seed for generation.--filter-model: Choose the pretrained backbone to derive the final mask.--mode: Choose the operational mode (generate,generate_anom,filter).
After this you should get results similar to the image below.
Alternatively, you can download the dataset use the following command (will require HuggingFace authentication). The whole thing will require ~20 min:
python download_dataset.pypython train.py --no-evalOptional arguments
-
--model: Choose the backbone to adapt (possible choices:radio,dinov3,dinov2,clip,siglip2). -
--peft-type: Choose what PEFT adaptersto use (possible choices:lora,dora). -
--peft-rank: Choose the rank for LoRA or DoRA. -
--data-path: Provide the path to the training set. It should follow the same setup as our generated one. -
--image-size: Choose the image size for training. RADIO, DINOv3, SigLIP2 use 768 and DINOv2 and CLIP use 672. This is done to get the same feature size. -
--out-path: Choose the path to store the models and the results. -
--seed: Choose the seed for training. -
--batch-size: Choose the total batch size for the model. It should be divisible by accumulation steps. -
--accumulation-steps: Number of gradient accumulation steps. -
--optimizer: Choose the optimizer for the model (possible choices:adamw,adam,muon). Only tested withadamw. -
--learning-rate: Choose the learning rate. -
--weight-decay: Choose the weight decay for the optimizer (it should be 0 formuon). -
--scheduler: Choose the scheduler for training (possible choices:none,cos,multisteplr,exp). By defaultnone. -
--train-steps: Amount of iteration steps for training. By default 200. -
--test-steps: How often is the model tested. Only the final model is saved. By default 100. -
--no-evaluate: Whether to have evaluation during and after training. -
--test-datasets: Which industrial datasets to test on. -
--medical-test-datasets-img: Which medical datasets with only image-level labels to test on. -
--medical-test-datasets-pix: Which medical datasets with only pixel-level labels to test on. -
--mean-kernel-size: Size of the mean kernel to smooth the final prediction.
Step 1: Data Setup:
Download the datasets below (only the ones you wish to evaluate on):
-
Industrial Domain (Original paper): MVTec AD, VisA, Real-IAD, MPDD, BTAD, KSDD, KSDD2, DAGM, DTD-Synthetic
-
Medical Domain (Pixel Level) ISIC, CVC-ColonDB, CVC-ClinicDB, Kvasir, Endo, TN3K.
-
Industrial Domain (Beyond the original paper): Real-IAD Variety, GoodsAD, RSDD
-
Industrial Domain 3D (Beyond the original paper): MVTec 3D, Eyecandies, Real-IAD D3
After downloading the data, change the dataset paths set in datasets\dataset.py
Step 2: Run Evaluation:
python test.py --model-path /path/to/folder/with/model.pkl -d <dataset_1> <dataset_2> ... <dataset_n>Optional arguments
-
--model: Choose the backbone to adapt (possible choices:radio,dinov3,dinov2,clip,siglip2). -
--model-path: Path to the folder with the pretrained model. -
--peft-type: Chose what PEFT adapters to use (possible choices:lora,dora). -
--peft-rank: Chose the rank of the rank for LoRA or DoRA. -
--image-size: Choose the image size for testing. -
--save-images: Flag whether to save images. -
--no-logging: Flag whether to disable logging. -
--out-path: Choose the path to store the models and the results. -
--mean-kernel-size: Size of the mean kernel to smooth the final prediction.
- Average results over the datasets used in the paper can be seen in the Tables below. We also offer several new backbones. The results differ from the paper due to the new setup working better for future extensions. Detailed results are in the folder
results. We also added results for Real-IAD Variety, Real-IAD D3, Eyecandies, MVTec 3D, GoodsAD and RSDD.
Pretrained models can be downloaded with:
bash download_checkpoints.shIndustiral datasets (MVTec AD, VisA, Real-IAD, MPDD, BTAD, KSDD, KSDD2, DTD, DAGM)
| Model | Download Link | I-AUROC | I-F1 | I-AP | P-AUROC | P-F1 | P-AP | AUPRO |
|---|---|---|---|---|---|---|---|---|
| RADIO | Download | 94.5 | 89.2 | 93.2 | 96.3 | 44.8 | 44.1 | 89.3 |
| DINOv2 | Download | 90.1 | 84.0 | 86.1 | 95.4 | 42.4 | 41.4 | 85.7 |
| DINOv3 | Download | 91.3 | 84.5 | 87.3 | 95.8 | 43.9 | 43.5 | 86.9 |
| CLIP | Download | 90.3 | 85.6 | 89.3 | 94.6 | 41.3 | 38.8 | 86.9 |
| SigLIP2 | Download | 91.1 | 84.3 | 85.1 | 95.7 | 43.0 | 41.9 | 86.4 |
Medical datasets - Image level (Head CT, BrainMRI, BR35H)
| Model | I-AUROC | I-F1 | I-AP |
|---|---|---|---|
| RADIO | 94.2 | 89.8 | 93.5 |
| DINOv2 | 88.6 | 84.6 | 89.9 |
| DINOv3 | 85.6 | 80.9 | 86.4 |
| CLIP | 90.2 | 86.8 | 90.8 |
| SigLIP2 | 92.5 | 88.6 | 91.4 |
Medical datasets - Pixel level (ISIC, ClinicDB, ColonDB, Kvasir, Endo, TN3K)
| Model | P-AUROC | P-F1 | P-AP | AUPRO |
|---|---|---|---|---|
| RADIO | 88.8 | 59.8 | 59.1 | 82.3 |
| DINOv2 | 88.7 | 60.8 | 63.0 | 80.9 |
| DINOv3 | 88.2 | 58.9 | 60.0 | 80.7 |
| CLIP | 80.9 | 49.0 | 47.2 | 72.3 |
| SigLIP2 | 88.4 | 58.5 | 58.0 | 81.6 |
- The following commands were used to achieve the above results using the dataset that can be downloaded.
RADIO: python train.py --model radio --image-size 768 --train-steps 200 --seed 12
DINOv3: python train.py --model dinov3 --image-size 768 --train-steps 500 --seed 12
SigLIP2: python train.py --model siglip2 --image-size 768 --train-steps 300 --seed 12
DINOv2: python train.py --model dinov2 --image-size 672 --train-steps 500 --seed 12
CLIP: python train.py --model clip --image-size 672 --train-steps 500 --seed 12
To test the model a single image use the following command:
python predict_single_image.py --image-path /path/to/image.png --model-path /path/to/folder/with/model.pklBy default it will save it to pred.png
If this code or dataset contributes to your research, please kindly consider citing our paper and give this repo ⭐️ :)
@InProceedings{fucka2026anomaly_vfm,
title={AnomalyVFM -- Transforming Vision Foundation Models into Zero-Shot Anomaly Detectors},
author={Fučka, Matic and Zavrtanik, Vitjan and Skočaj, Danijel},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2026}
}
This project is based on FLUX, RADIO, and the DINO family of models. Thanks for their excellent works. Also thanks to the curators of all the datasets used for the Evaluation.
For any questions, please feel free to contact us. Also feel free to suggest any possible simplifications to the code.


