A Model Inversion Attack Framework for High-Fidelity Training Data Extraction from CLIP Models
This repository contains the official implementation of LeakyCLIP, a novel model inversion attack framework designed to reconstruct training images from CLIP text embeddings. Our method achieves 258% improvement in SSIM compared to baseline approaches on LAION-2B subset.
LeakyCLIP: Extracting Training Data from CLIP Yunhao Chen, Shujie Wang, Xin Wang, Xingjun Ma Fudan University arXiv:2508.00756v3 [cs.CR]
LeakyCLIP addresses three fundamental challenges in CLIP inversion:
-
Non-Robust Features: CLIP learns features that are highly predictive but may not correspond to meaningful visual concepts, leading to unstable optimization landscapes.
- Solution: Adversarial Fine-Tuning (AFT) using FARE [39] to smooth gradients
-
Limited Visual Semantics: Text embeddings capture abstract concepts but lack high-level visual information (object layout, scale).
- Solution: Linear Transformation-Based Embedding Alignment (EA) to project text embeddings into pseudo-image embeddings
-
Lack of Low-Level Features: Pseudo-image embeddings lack fine-grained details for realistic reconstruction.
- Solution: Controlled Stable Diffusion-Based Refinement (DR) to add textures and sharp edges
Figure 1: The LeakyCLIP three-stage pipeline: (1) Adversarial Fine-Tuning (AFT) smooths the optimization landscape, (2) Embedding Alignment (EA) projects text embeddings into pseudo-image embeddings via learned linear transformation M, (3) Diffusion Refinement (DR) adds low-level details using Stable Diffusion.
π VAE Latent Space Inversion (Improved): This implementation includes an enhanced inversion mode that optimizes directly in the VAE latent space (instead of pixel space), resulting in better fidelity, faster convergence, and more realistic reconstructions. Enable with
--inversion-space vae.
- Three-Stage Pipeline: Adversarial Fine-Tuning β Embedding Alignment β Diffusion Refinement
- Multi-Architecture Support: ViT-B/16, ViT-B/32, ViT-L/14, ConvNeXt-Base
- VAE Latent Space Inversion: Optimize in SD-VAE latent space for improved efficiency and fidelity
- Comprehensive Metrics: SSIM, LPIPS, CLIP Score, SSCD
- Membership Inference: Detect training data membership from reconstruction metrics
- Privacy Risk Assessment: Extract sensitive PII including facial images
https://github.com/dongdongunique/LeakyCLIP.git
cd LeakyCLIP
pip install -r requirements.txtLeakyCLIP requires pre-trained CLIP models. Set up the model directory:
mkdir -p ./models
β οΈ Important: Before running experiments, modifypaths.pyto match your system paths:# In leakyclip_release/paths.py DEFAULT_MODEL_ROOT = os.environ.get( "LEAKYCLIP_MODEL_ROOT", "/your/path/to/models", # <-- Update this ) DEFAULT_DATA_ROOT = os.environ.get( "LEAKYCLIP_DATA_ROOT", "/your/path/to/data", # <-- Update this )
Required Models:
| Model | Architecture | HuggingFace ID | Purpose |
|---|---|---|---|
| ViT-B/16 | ViT-B-16 | laion/CLIP-ViT-B-16-laion2B-s34B-b88K |
Main inversion model |
| ViT-B/32 | ViT-B-32 | laion/CLIP-ViT-B-32-laion2B-s34B-b79K |
Alternative architecture |
| ViT-L/14 | ViT-L-14 | laion/CLIP-ViT-L-14-laion2B-s32B-b82K |
Large model variant |
| ConvNeXt-Base | ConvNeXt | laion/CLIP-convnext_base_w_320-laion_aesthetic-s13B-b82K |
CLIP Score metric |
| Robust ViT-B/16 | ViT-B-16 | chs20/FARE4-ViT-B-16-laion2B-s34B-b88K |
FARE adversarial fine-tuned |
| Robust ViT-B/32 | ViT-B-32 | chs20/FARE4-ViT-B-32-laion2B-s34B-b79K |
FARE adversarial fine-tuned |
| Robust ViT-L/14 | ViT-L-14 | Erdos2568/Robust_CLIP (eps8/20000.pt) |
Robust CLIP (eps=8) |
Download via HuggingFace:
# ViT-B/16 (standard)
python -c "import open_clip; open_clip.create_model_and_transforms('ViT-B-16', pretrained='laion2b_s34b_b88k')"
# Robust ViT-B/16 (FARE fine-tuned)
python -c "import open_clip; open_clip.create_model_and_transforms('ViT-B-16', pretrained='chs20/FARE4-ViT-B-16-laion2B-s34B-b88K')"Or download manually:
# Using huggingface-cli
huggingface-cli download laion/CLIP-ViT-B-16-laion2B-s34B-b88K --local-dir ./models/vit-b-16
huggingface-cli download chs20/FARE4-ViT-B-16-laion2B-s34B-b88K --local-dir ./models/vit-b-16-robust
huggingface-cli download laion/CLIP-convnext_base_w_320-laion_aesthetic-s13B-b82K --local-dir ./models/convnext-baseAdditional Robust Models:
| Model | HuggingFace URL | Description |
|---|---|---|
| Robust ViT-B/32 | chs20/FARE4-ViT-B-32-laion2B-s34B-b79K | FARE adversarial fine-tuned ViT-B/32 |
| Robust ViT-L/14 | Erdos2568/Robust_CLIP | Robust CLIP ViT-L/14 (eps=8) |
Download Robust Models:
# Robust ViT-B/32 (FARE fine-tuned)
huggingface-cli download chs20/FARE4-ViT-B-32-laion2B-s34B-b79K --local-dir ./models/vit-b-32-robust
# Robust ViT-L/14 (eps=8)
wget https://huggingface.co/Erdos2568/Robust_CLIP/resolve/main/eps8/20000.pt \
-O ./models/vit-l-14-robust.pthuggingface-cli download stabilityai/stable-diffusion-xl-base-1.0 --local-dir ./models/sdxl-base-1.0huggingface-cli download madebyollin/sdxl-vae-fp16-fix --local-dir ./models/sdxl-vaeSSCD (Self-Supervised Copy Detection):
# Download SSCD weights
wget https://dl.fbaipublicfiles.com/sscd-copy-detection/sscd_disc_mixup.torchvision.pt \
-O ./models/sscd_disc_mixup.torchvision.ptLAION-HD Subset (Recommended):
# Download curated high-quality subset (~1M samples)
python -c "
from datasets import load_dataset
dataset = load_dataset('yuvalkirstain/laion-hd-subset', split='train')
# Save to disk...
dataset.save_to_disk('./HF_dataset/laion')
"Furniture Object Dataset:
# Download furniture object dataset (~10K samples)
python -c "
from datasets import load_dataset
dataset = load_dataset('abrarlohia/sample_furniture_object', split='train')
# Save to disk...
dataset.save_to_disk('./HF_dataset/furniture_object')
"Supported Datasets:
laion- LAION-2B subset (main evaluation)flickr- Flickr30k (caption diversity)furniture_object- Furniture objects (structured content)lfw- Labeled Faces in the Wild (privacy evaluation)
HuggingFace Dataset Sources:
| Dataset | HuggingFace ID | Description |
|---|---|---|
| LAION-HD Subset | yuvalkirstain/laion-hd-subset |
Curated high-quality subset (recommended) |
| Furniture Objects | abrarlohia/sample_furniture_object |
Structured furniture images |
Note: We recommend using yuvalkirstain/laion-hd-subset instead of the full LAION-2B-en dataset (~2.3 billion images) for faster experimentation.
If you want to train your own embedding alignment matrix:
python -m leakyclip_release.ea_train \
--model-name ViT-B-16_robust_fair_4 \
--dataset laion \
--num-samples 2000 \
--batch-size 256 \
--output-dir ./text2image_embeddingThis learns the linear transformation matrix M that maps text embeddings to pseudo-image embeddings.
python -m leakyclip_release.main \
--config ./leakyclip_release/configs/method/single.json \
--text "a mid-century modern beige armchair with wooden tapered legs" \
--output ./out/inversion.png \
--compute-metricsExample Output:
Figure: Example reconstruction from text prompt using LeakyCLIP with VAE latent space inversion and Stable Diffusion refinement.
python -m leakyclip_release.main \
--config ./leakyclip_release/configs/method/laion.json \
--dataset-name laion \
--output-dir ./out/laion_results \
--max-samples 1000 \
--compute-metricsLeakyCLIP provides a Python API for programmatic access:
from leakyclip_release.models import create_model
from leakyclip_release.inversion import CLIPInverter, InversionConfig
from leakyclip_release.refinement import SDRefiner, SDRefineConfig
import torch
# Setup device
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load CLIP model
model, tokenizer, _, _ = create_model("ViT-B-16", device=torch.device(device))
# Create inverter with custom config
inverter = CLIPInverter(
model,
tokenizer,
device=device,
config=InversionConfig(
num_steps=500,
lr=0.03,
num_views=32,
image_size=1024,
inversion_space="vae", # VAE latent space for better fidelity
vae_model_id="./models/sdxl-vae",
),
)
# Optional: Add Stable Diffusion refinement
refiner = SDRefiner(
SDRefineConfig(
model_id="./models/sdxl-base-1.0",
strength=0.3,
num_inference_steps=50,
)
)
# Perform inversion from text
result = inverter.invert_from_text(
PROMPT = "A sleek contemporary living room with a gray sectional sofa, glass coffee table, floor lamp with warm lighting, hardwood floor, large window with curtains, modern interior design, architectural digest style, high resolution",
return_metrics=False,
)
# Apply refinement (optional)
refined_result = refiner.refine(result, PROMPT = "A sleek contemporary living room with a gray sectional sofa, glass coffee table, floor lamp with warm lighting, hardwood floor, large window with curtains, modern interior design, architectural digest style, high resolution")
# Save result
refined_result.save("./out/inversion_sdk.png")from PIL import Image
from leakyclip_release.eval import LPIPSMetric, SSIMMetric
# Load reference image
reference = Image.open("reference.png").convert("RGB")
# Build metrics
metrics = [
LPIPSMetric(device=device),
SSIMMetric(device=device),
]
# Invert with metrics evaluation
result, metric_values = inverter.invert_from_text(
"a mid-century modern beige armchair",
reference_image=reference,
metrics=metrics,
return_metrics=True,
)
print(f"LPIPS: {metric_values.get('lpips')}")
print(f"SSIM: {metric_values.get('ssim')}")from leakyclip_release.inversion import InversionConfig
# Configure for VAE latent space
vae_config = InversionConfig(
image_size=1024,
inversion_space="vae", # Use VAE latent space
vae_model_id="./models/sdxl-vae",
vae_dtype="fp16",
vae_scaling_factor=0.13025,
latent_l2_weight=0.001, # L2 regularization in latent space
)
inverter = CLIPInverter(model, tokenizer, device=device, config=vae_config)
result = inverter.invert_from_text(PROMPT = "A sleek contemporary living room with a gray sectional sofa, glass coffee table, floor lamp with warm lighting, hardwood floor, large window with curtains, modern interior design, architectural digest style, high resolution")# Load adversarially fine-tuned (robust) model
robust_model, _, _, _ = create_model("ViT-B-16_robust_fair_4", device=torch.device(device))
# Use with embedding alignment
inverter = CLIPInverter(
robust_model,
tokenizer,
device=device,
config=InversionConfig(
align_model_name="ViT-B-16_robust_fair_4",
transpose_model_name="pinverse_model",
),
)leakyclip_release/
βββ main.py # Main entry point for inversion
βββ ea_train.py # Embedding alignment training
βββ config.py # Configuration management
βββ data.py # Dataset loaders
βββ models/
β βββ model_factory.py # CLIP model factory (ViT, ConvNeXt)
βββ inversion/
β βββ inverter.py # CLIP inversion logic
β βββ augmentations.py # Data augmentation pipeline
βββ eval/
β βββ base.py # Metric base classes
β βββ metrics.py # SSIM, LPIPS, CS, SSCD metrics
βββ refinement/
β βββ sd_refiner.py # Stable Diffusion refinement
βββ configs/ # Configuration files
βββ inversion/default.json
βββ method/single.json
βββ method/laion.json
βββ method/flickr.json
| Parameter | Default | Description |
|---|---|---|
num_steps |
500 | Inversion optimization steps |
lr |
0.03 | Learning rate for inversion |
num_views |
32 | Number of augmented views |
image_size |
1024 | Output image size |
tv_weight |
0.05 | Total variation regularization |
patch_loss_weight |
5.0 | Patch-based consistency loss |
thresholding |
0.95 | Dynamic threshold quantile |
refine_strength |
0.3-0.55 | SD img2img strength |
| Variable | Description | Default |
|---|---|---|
LEAKYCLIP_MODEL_ROOT |
Model checkpoint directory | ./models |
LEAKYCLIP_DATA_ROOT |
Dataset cache directory | ./HF_dataset |
LEAKYCLIP_ALIGN_ROOT |
Embedding alignment weights | ./text2image_embedding |
LEAKYCLIP_SSCD_WEIGHTS |
SSCD model weights | ./models/sscd_*.pt |
Learn the linear transformation matrix M for embedding alignment:
python -m leakyclip_release.ea_train \
--model-name ViT-B-16_robust_fair_4 \
--dataset laion \
--batch-size 256 \
--output-dir ./text2image_embeddingSupported datasets: laion, flickr, furniture_object
We adopt five complementary metrics for comprehensive evaluation:
- SSIM [48]: Structural similarity [-1, 1], higher is better
- LPIPS [57]: Perceptual similarity [0, β), lower is better
- CLIP Score (CS): Cosine similarity using ConvNeXt-Base [-1, 1]
- SSCD [36]: Self-supervised copy detection [-1, 1]
Attacker Capabilities:
- White-box access to CLIP parameters
- Access to exact training captions paired with target images
- Standard assumptions for rigorous privacy vulnerability assessment
Attack Goal: Reconstruct training images from text prompts via model inversion.
If you use this code in your research, please cite:
@article{chen2025leakyclip,
title={LeakyCLIP: Extracting Training Data from CLIP},
author={Chen, Yunhao and Wang, Shujie and Wang, Xin and Ma, Xingjun},
journal={arXiv preprint arXiv:2508.00756},
year={2025}
}Key papers and methods used in this work:
- [37] Radford et al. "Learning Transferable Visual Models From Natural Language Supervision" (CLIP)
- [39] Wang et al. "Fine-tuning for Adversarially Robust Embeddings" (FARE)
- [38] Rombach et al. "High-Resolution Image Synthesis with Latent Diffusion Models" (Stable Diffusion)
- [48] Wang et al. "Image Quality Assessment: From Error Visibility to Structural Similarity" (SSIM)
- [57] Zhang et al. "The Unreasonable Effectiveness of Deep Features as a Perceptual Metric" (LPIPS)
- [36] Pizzi et al. "A Self-Supervised Descriptor for Image Copy Detection" (SSCD)
This code is provided for research purposes only to understand and mitigate privacy risks in multimodal models. The authors are not responsible for any misuse of this code for unauthorized data extraction or privacy violations.
For questions or issues, please contact:
- Xingjun Ma (xingjunma@fudan.edu.cn)
- Open an issue on GitHub
MIT License

