LeakyCLIP: Extracting Training Data from CLIP

A Model Inversion Attack Framework for High-Fidelity Training Data Extraction from CLIP Models

This repository contains the official implementation of LeakyCLIP, a novel model inversion attack framework designed to reconstruct training images from CLIP text embeddings. Our method achieves 258% improvement in SSIM compared to baseline approaches on LAION-2B subset.

📄 Paper

LeakyCLIP: Extracting Training Data from CLIP Yunhao Chen, Shujie Wang, Xin Wang, Xingjun Ma Fudan University arXiv:2508.00756v3 [cs.CR]

🎯 Overview

LeakyCLIP addresses three fundamental challenges in CLIP inversion:

Non-Robust Features: CLIP learns features that are highly predictive but may not correspond to meaningful visual concepts, leading to unstable optimization landscapes.
- Solution: Adversarial Fine-Tuning (AFT) using FARE [39] to smooth gradients
Limited Visual Semantics: Text embeddings capture abstract concepts but lack high-level visual information (object layout, scale).
- Solution: Linear Transformation-Based Embedding Alignment (EA) to project text embeddings into pseudo-image embeddings
Lack of Low-Level Features: Pseudo-image embeddings lack fine-grained details for realistic reconstruction.
- Solution: Controlled Stable Diffusion-Based Refinement (DR) to add textures and sharp edges

Figure 1: The LeakyCLIP three-stage pipeline: (1) Adversarial Fine-Tuning (AFT) smooths the optimization landscape, (2) Embedding Alignment (EA) projects text embeddings into pseudo-image embeddings via learned linear transformation M, (3) Diffusion Refinement (DR) adds low-level details using Stable Diffusion.

🆕 VAE Latent Space Inversion (Improved): This implementation includes an enhanced inversion mode that optimizes directly in the VAE latent space (instead of pixel space), resulting in better fidelity, faster convergence, and more realistic reconstructions. Enable with --inversion-space vae.

🔑 Key Features

Three-Stage Pipeline: Adversarial Fine-Tuning → Embedding Alignment → Diffusion Refinement
Multi-Architecture Support: ViT-B/16, ViT-B/32, ViT-L/14, ConvNeXt-Base
VAE Latent Space Inversion: Optimize in SD-VAE latent space for improved efficiency and fidelity
Comprehensive Metrics: SSIM, LPIPS, CLIP Score, SSCD
Membership Inference: Detect training data membership from reconstruction metrics
Privacy Risk Assessment: Extract sensitive PII including facial images

🚀 Quick Start

Installation

https://github.com/dongdongunique/LeakyCLIP.git
cd LeakyCLIP
pip install -r requirements.txt

Model and Dataset Preparation

1. Download CLIP Models

LeakyCLIP requires pre-trained CLIP models. Set up the model directory:

mkdir -p ./models

⚠️ Important: Before running experiments, modify paths.py to match your system paths:

# In leakyclip_release/paths.py
DEFAULT_MODEL_ROOT = os.environ.get(
    "LEAKYCLIP_MODEL_ROOT",
    "/your/path/to/models",  # <-- Update this
)
DEFAULT_DATA_ROOT = os.environ.get(
    "LEAKYCLIP_DATA_ROOT",
    "/your/path/to/data",  # <-- Update this
)

Required Models:

Model	Architecture	HuggingFace ID	Purpose
ViT-B/16	ViT-B-16	`laion/CLIP-ViT-B-16-laion2B-s34B-b88K`	Main inversion model
ViT-B/32	ViT-B-32	`laion/CLIP-ViT-B-32-laion2B-s34B-b79K`	Alternative architecture
ViT-L/14	ViT-L-14	`laion/CLIP-ViT-L-14-laion2B-s32B-b82K`	Large model variant
ConvNeXt-Base	ConvNeXt	`laion/CLIP-convnext_base_w_320-laion_aesthetic-s13B-b82K`	CLIP Score metric
Robust ViT-B/16	ViT-B-16	`chs20/FARE4-ViT-B-16-laion2B-s34B-b88K`	FARE adversarial fine-tuned
Robust ViT-B/32	ViT-B-32	`chs20/FARE4-ViT-B-32-laion2B-s34B-b79K`	FARE adversarial fine-tuned
Robust ViT-L/14	ViT-L-14	`Erdos2568/Robust_CLIP` (eps8/20000.pt)	Robust CLIP (eps=8)

Download via HuggingFace:

# ViT-B/16 (standard)
python -c "import open_clip; open_clip.create_model_and_transforms('ViT-B-16', pretrained='laion2b_s34b_b88k')"

# Robust ViT-B/16 (FARE fine-tuned)
python -c "import open_clip; open_clip.create_model_and_transforms('ViT-B-16', pretrained='chs20/FARE4-ViT-B-16-laion2B-s34B-b88K')"

Or download manually:

# Using huggingface-cli
huggingface-cli download laion/CLIP-ViT-B-16-laion2B-s34B-b88K --local-dir ./models/vit-b-16
huggingface-cli download chs20/FARE4-ViT-B-16-laion2B-s34B-b88K --local-dir ./models/vit-b-16-robust
huggingface-cli download laion/CLIP-convnext_base_w_320-laion_aesthetic-s13B-b82K --local-dir ./models/convnext-base

Additional Robust Models:

Model	HuggingFace URL	Description
Robust ViT-B/32	chs20/FARE4-ViT-B-32-laion2B-s34B-b79K	FARE adversarial fine-tuned ViT-B/32
Robust ViT-L/14	Erdos2568/Robust_CLIP	Robust CLIP ViT-L/14 (eps=8)

Download Robust Models:

# Robust ViT-B/32 (FARE fine-tuned)
huggingface-cli download chs20/FARE4-ViT-B-32-laion2B-s34B-b79K --local-dir ./models/vit-b-32-robust

# Robust ViT-L/14 (eps=8)
wget https://huggingface.co/Erdos2568/Robust_CLIP/resolve/main/eps8/20000.pt \
  -O ./models/vit-l-14-robust.pt

2. Download Stable Diffusion Models (for Refinement)

huggingface-cli download stabilityai/stable-diffusion-xl-base-1.0 --local-dir ./models/sdxl-base-1.0

3. Download VAE Model (for VAE Latent Inversion)

huggingface-cli download madebyollin/sdxl-vae-fp16-fix --local-dir ./models/sdxl-vae

4. Download Evaluation Metrics

SSCD (Self-Supervised Copy Detection):

# Download SSCD weights
wget https://dl.fbaipublicfiles.com/sscd-copy-detection/sscd_disc_mixup.torchvision.pt \
  -O ./models/sscd_disc_mixup.torchvision.pt

5. Prepare Datasets

LAION-HD Subset (Recommended):

# Download curated high-quality subset (~1M samples)
python -c "
from datasets import load_dataset
dataset = load_dataset('yuvalkirstain/laion-hd-subset', split='train')
# Save to disk...
dataset.save_to_disk('./HF_dataset/laion')
"

Furniture Object Dataset:

# Download furniture object dataset (~10K samples)
python -c "
from datasets import load_dataset
dataset = load_dataset('abrarlohia/sample_furniture_object', split='train')
# Save to disk...
dataset.save_to_disk('./HF_dataset/furniture_object')
"

Supported Datasets:

laion - LAION-2B subset (main evaluation)
flickr - Flickr30k (caption diversity)
furniture_object - Furniture objects (structured content)
lfw - Labeled Faces in the Wild (privacy evaluation)

HuggingFace Dataset Sources:

Dataset	HuggingFace ID	Description
LAION-HD Subset	`yuvalkirstain/laion-hd-subset`	Curated high-quality subset (recommended)
Furniture Objects	`abrarlohia/sample_furniture_object`	Structured furniture images

Note: We recommend using yuvalkirstain/laion-hd-subset instead of the full LAION-2B-en dataset (~2.3 billion images) for faster experimentation.

6. Train Embedding Alignment (Optional)

If you want to train your own embedding alignment matrix:

python -m leakyclip_release.ea_train \
  --model-name ViT-B-16_robust_fair_4 \
  --dataset laion \
  --num-samples 2000 \
  --batch-size 256 \
  --output-dir ./text2image_embedding

This learns the linear transformation matrix M that maps text embeddings to pseudo-image embeddings.

Single Image Inversion

python -m leakyclip_release.main \
  --config ./leakyclip_release/configs/method/single.json \
  --text "a mid-century modern beige armchair with wooden tapered legs" \
  --output ./out/inversion.png \
  --compute-metrics

Example Output:

Figure: Example reconstruction from text prompt using LeakyCLIP with VAE latent space inversion and Stable Diffusion refinement.

Dataset Evaluation

python -m leakyclip_release.main \
  --config ./leakyclip_release/configs/method/laion.json \
  --dataset-name laion \
  --output-dir ./out/laion_results \
  --max-samples 1000 \
  --compute-metrics

Python SDK Usage

LeakyCLIP provides a Python API for programmatic access:

from leakyclip_release.models import create_model
from leakyclip_release.inversion import CLIPInverter, InversionConfig
from leakyclip_release.refinement import SDRefiner, SDRefineConfig
import torch

# Setup device
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load CLIP model
model, tokenizer, _, _ = create_model("ViT-B-16", device=torch.device(device))

# Create inverter with custom config
inverter = CLIPInverter(
    model,
    tokenizer,
    device=device,
    config=InversionConfig(
        num_steps=500,
        lr=0.03,
        num_views=32,
        image_size=1024,
        inversion_space="vae",  # VAE latent space for better fidelity
        vae_model_id="./models/sdxl-vae",
    ),
)

# Optional: Add Stable Diffusion refinement
refiner = SDRefiner(
    SDRefineConfig(
        model_id="./models/sdxl-base-1.0",
        strength=0.3,
        num_inference_steps=50,
    )
)

# Perform inversion from text
result = inverter.invert_from_text(
    PROMPT = "A sleek contemporary living room with a gray sectional sofa, glass coffee table, floor lamp with warm lighting, hardwood floor, large window with curtains, modern interior design, architectural digest style, high resolution",
    return_metrics=False,
)

# Apply refinement (optional)
refined_result = refiner.refine(result, PROMPT = "A sleek contemporary living room with a gray sectional sofa, glass coffee table, floor lamp with warm lighting, hardwood floor, large window with curtains, modern interior design, architectural digest style, high resolution")

# Save result
refined_result.save("./out/inversion_sdk.png")

Advanced: Inversion with Reference Image and Metrics

from PIL import Image
from leakyclip_release.eval import LPIPSMetric, SSIMMetric

# Load reference image
reference = Image.open("reference.png").convert("RGB")

# Build metrics
metrics = [
    LPIPSMetric(device=device),
    SSIMMetric(device=device),
]

# Invert with metrics evaluation
result, metric_values = inverter.invert_from_text(
    "a mid-century modern beige armchair",
    reference_image=reference,
    metrics=metrics,
    return_metrics=True,
)

print(f"LPIPS: {metric_values.get('lpips')}")
print(f"SSIM: {metric_values.get('ssim')}")

Advanced: VAE Latent Space Inversion

from leakyclip_release.inversion import InversionConfig

# Configure for VAE latent space
vae_config = InversionConfig(
    image_size=1024,
    inversion_space="vae",           # Use VAE latent space
    vae_model_id="./models/sdxl-vae",
    vae_dtype="fp16",
    vae_scaling_factor=0.13025,
    latent_l2_weight=0.001,          # L2 regularization in latent space
)

inverter = CLIPInverter(model, tokenizer, device=device, config=vae_config)
result = inverter.invert_from_text(PROMPT = "A sleek contemporary living room with a gray sectional sofa, glass coffee table, floor lamp with warm lighting, hardwood floor, large window with curtains, modern interior design, architectural digest style, high resolution")

Advanced: Robust Model Usage (FARE)

# Load adversarially fine-tuned (robust) model
robust_model, _, _, _ = create_model("ViT-B-16_robust_fair_4", device=torch.device(device))

# Use with embedding alignment
inverter = CLIPInverter(
    robust_model,
    tokenizer,
    device=device,
    config=InversionConfig(
        align_model_name="ViT-B-16_robust_fair_4",
        transpose_model_name="pinverse_model",
    ),
)

🏗️ Architecture

leakyclip_release/
├── main.py                    # Main entry point for inversion
├── ea_train.py               # Embedding alignment training
├── config.py                 # Configuration management
├── data.py                   # Dataset loaders
├── models/
│   └── model_factory.py      # CLIP model factory (ViT, ConvNeXt)
├── inversion/
│   ├── inverter.py           # CLIP inversion logic
│   └── augmentations.py      # Data augmentation pipeline
├── eval/
│   ├── base.py               # Metric base classes
│   └── metrics.py            # SSIM, LPIPS, CS, SSCD metrics
├── refinement/
│   └── sd_refiner.py         # Stable Diffusion refinement
└── configs/                  # Configuration files
    ├── inversion/default.json
    ├── method/single.json
    ├── method/laion.json
    └── method/flickr.json

⚙️ Configuration

Key Hyperparameters

Parameter	Default	Description
`num_steps`	500	Inversion optimization steps
`lr`	0.03	Learning rate for inversion
`num_views`	32	Number of augmented views
`image_size`	1024	Output image size
`tv_weight`	0.05	Total variation regularization
`patch_loss_weight`	5.0	Patch-based consistency loss
`thresholding`	0.95	Dynamic threshold quantile
`refine_strength`	0.3-0.55	SD img2img strength

Environment Variables

Variable	Description	Default
`LEAKYCLIP_MODEL_ROOT`	Model checkpoint directory	`./models`
`LEAKYCLIP_DATA_ROOT`	Dataset cache directory	`./HF_dataset`
`LEAKYCLIP_ALIGN_ROOT`	Embedding alignment weights	`./text2image_embedding`
`LEAKYCLIP_SSCD_WEIGHTS`	SSCD model weights	`./models/sscd_*.pt`

🧪 Training Embedding Alignment

Learn the linear transformation matrix M for embedding alignment:

python -m leakyclip_release.ea_train \
  --model-name ViT-B-16_robust_fair_4 \
  --dataset laion \
  --batch-size 256 \
  --output-dir ./text2image_embedding

Supported datasets: laion, flickr, furniture_object

📈 Evaluation Metrics

We adopt five complementary metrics for comprehensive evaluation:

SSIM [48]: Structural similarity [-1, 1], higher is better
LPIPS [57]: Perceptual similarity [0, ∞), lower is better
CLIP Score (CS): Cosine similarity using ConvNeXt-Base [-1, 1]
SSCD [36]: Self-supervised copy detection [-1, 1]

🔒 Threat Model

Attacker Capabilities:

White-box access to CLIP parameters
Access to exact training captions paired with target images
Standard assumptions for rigorous privacy vulnerability assessment

Attack Goal: Reconstruct training images from text prompts via model inversion.

📝 Citation

If you use this code in your research, please cite:

@article{chen2025leakyclip,
  title={LeakyCLIP: Extracting Training Data from CLIP},
  author={Chen, Yunhao and Wang, Shujie and Wang, Xin and Ma, Xingjun},
  journal={arXiv preprint arXiv:2508.00756},
  year={2025}
}

📚 References

Key papers and methods used in this work:

[37] Radford et al. "Learning Transferable Visual Models From Natural Language Supervision" (CLIP)
[39] Wang et al. "Fine-tuning for Adversarially Robust Embeddings" (FARE)
[38] Rombach et al. "High-Resolution Image Synthesis with Latent Diffusion Models" (Stable Diffusion)
[48] Wang et al. "Image Quality Assessment: From Error Visibility to Structural Similarity" (SSIM)
[57] Zhang et al. "The Unreasonable Effectiveness of Deep Features as a Perceptual Metric" (LPIPS)
[36] Pizzi et al. "A Self-Supervised Descriptor for Image Copy Detection" (SSCD)

⚠️ Disclaimer

This code is provided for research purposes only to understand and mitigate privacy risks in multimodal models. The authors are not responsible for any misuse of this code for unauthorized data extraction or privacy violations.

📧 Contact

For questions or issues, please contact:

Xingjun Ma (xingjunma@fudan.edu.cn)
Open an issue on GitHub

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
configs		configs
docs		docs
eval		eval
inversion		inversion
refinement		refinement
scripts		scripts
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
config.py		config.py
data.py		data.py
ea_train.py		ea_train.py
main.py		main.py
paths.py		paths.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

LeakyCLIP: Extracting Training Data from CLIP

📄 Paper

🎯 Overview

🔑 Key Features

🚀 Quick Start

Installation

Model and Dataset Preparation

1. Download CLIP Models

2. Download Stable Diffusion Models (for Refinement)

3. Download VAE Model (for VAE Latent Inversion)

4. Download Evaluation Metrics

5. Prepare Datasets

6. Train Embedding Alignment (Optional)

Single Image Inversion

Dataset Evaluation

Python SDK Usage

Advanced: Inversion with Reference Image and Metrics

Advanced: VAE Latent Space Inversion

Advanced: Robust Model Usage (FARE)

🏗️ Architecture

⚙️ Configuration

Key Hyperparameters

Environment Variables

🧪 Training Embedding Alignment

📈 Evaluation Metrics

🔒 Threat Model

📝 Citation

📚 References

⚠️ Disclaimer

📧 Contact

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages