Skip to content

M-HuangX/LAION-Beyond

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Few-Shot Name Learning & LAION-Beyond Benchmark

This is the official code release for our CVPR 2025 paper:

Reproducible Vision-Language Models Meet Concepts Out of Pre-Training
CVPR 2025
[Paper] | [Project Page] | [HuggingFace Dataset] | [Code]


Overview

FSNL (Few-Shot Name Learning) is a prompt-learning method for CLIP/OpenCLIP that learns per-class name embeddings directly instead of shared context vectors (as in CoOp). Given only a handful of labeled images per class, FSNL optimizes a small set of token-level embeddings that replace each class name in the text prompt, while keeping all CLIP weights frozen.

We also introduce the LAION-Beyond benchmark — a collection of nine fine-grained datasets covering concepts absent from LAION-400M pre-training data — to evaluate out-of-pretraining (OOP) generalization.

IP vs OOP generalization comparison
Figure 1: Comparison between IP and OOP generalization. The former evaluates OpenCLIP's generalization with visual concepts seen in pre-training phases, whereas the latter justifies its generalization through the concepts absent during pre-training.


Key Findings

1. Strong image feature representation for OOP concepts. OpenCLIP's image encoder forms well-separated clusters for OOP concepts (clustering accuracy gap < 3% on most domains vs. IP concepts).

t-SNE visualization
Figure 3: t-SNE visualization of image features for OOP (Plants & Fungi) and mixed OOP/IP classes.

2. Image-text alignment failure. Despite strong image features, zero-shot transfer on OOP concepts fails significantly — the token embeddings for OOP class names were never aligned with visual features during pre-training. This gap persists even as pre-training data scales from 400M to 5B.

Zero-shot accuracy OOP vs IP
Figure 4a: OpenCLIP's zero-shot accuracy on OOP vs. IP classes in LAION-Beyond.

3. Name-tuning is the key. Our FSNL and ZSNL algorithms, which fine-tune only the name (token) embeddings of OOP concepts, efficiently restore OOP generalization without degrading IP performance.


The LAION-Beyond Benchmark

LAION-Beyond distribution
Figure 2a: Statistics of OOP and IP concepts and images in LAION-Beyond (400M), (2B), and (5B).

LAION-Beyond is a multi-domain benchmark for evaluating OOP concept generalization of vision-language models.

Split Images Concepts
OOP 106,052 674
IP 51,330 324
Total 157,382 998

Included datasets: Pokemon, Animals, Architecture50_23, Attire54_28, FolkArt59_27, Food53_27, Insects_Spiders106_52, Landmark59_30, Plants_Fugi113_56

Each dataset comes with *_OOP (Out-of-Pretraining) and *_IP (In-Pretraining) subdirectories. Download and place them under the same root directory. The dataset loader auto-discovers the correct subdirectory via glob pattern matching.

Each dataset directory should contain:

<DatasetName>_OOP/
    images/
        <classname>/
            *.jpg
    split_Xin_<name>.json
<DatasetName>_IP/
    images/
        ...
    split_Xin_<name>.json

Dataset: [HuggingFace]


Experimental Results

OOP Few-Shot Learning (4-shot, H-mean of OOP & IP accuracy)

Few-shot comparison
Figure 5: OOP few-shot learning performance (1,2,4,8,16 shots) of different methods across domains in LAION-Beyond (400M).

Method Animals Architecture Attire FolkArt Food Insects Landmark Plants Pokemon Avg
OpenCLIP 26.75 30.75 25.88 35.04 15.36 22.38 40.25 21.43 24.48 26.92
CoOp 31.37 57.80 50.39 52.06 42.55 25.73 85.89 24.78 35.52 45.12
CLIP-Adapter 38.98 59.27 64.56 56.32 64.32 32.51 90.82 31.97 54.99 54.86
FSNL (ours) 46.17 62.63 71.65 63.03 70.00 44.03 94.48 44.12 68.87 62.55

Performance Across Model Scales

FSNL across model scales
Figure 6: FSNL performance under neural scaling law. Light circles = zero-shot baselines; dark circles = after FSNL tuning.

FSNL demonstrates consistent improvements across different model scales (ViT-B/16, ViT-L/14) and CLIP variants (OpenAI CLIP, OpenCLIP, EVA-CLIP 2B).


Installation

1. Clone the repository

git clone https://github.com/M-HuangX/LAION-Beyond.git
cd FSNL

2. Install dependencies

pip install torch torchvision
pip install open_clip_torch ftfy regex tqdm scikit-learn

Install Dassl (modified version included at ../Dassl.pytorch/):

cd ../Dassl.pytorch
pip install -e .
cd ../FSNL

Dataset Download

Download the full LAION-Beyond dataset from HuggingFace (~15GB):

from huggingface_hub import snapshot_download

local_dir = snapshot_download(
    repo_id="MHuangX/LAION-Beyond",
    repo_type="dataset",
    local_dir="./LAION_Beyond"   # use this path as --root in experiments
)

Or download a single domain only:

local_dir = snapshot_download(
    repo_id="MHuangX/LAION-Beyond",
    repo_type="dataset",
    local_dir="./LAION_Beyond",
    allow_patterns="Animals42_IP/**"
)

Requires pip install huggingface_hub. Then pass the download path as --root when running experiments.


Running Experiments

All experiments go through train.py.

Training FSNL

python train.py \
  --root /path/to/datasets \
  --seed 1 \
  --trainer FSNL_openclip \
  --dataset-config-file configs/datasets/Animals92_42.yaml \
  --config-file configs/trainers/FSNL_openclip/vit_b16_ep100.yaml \
  --output-dir output/fsnl_animals_4shot \
  TRAINER.FSNL.FLE False \
  TRAINER.FSNL.N_CTX 4 \
  TRAINER.FSNL.CIFC True \
  TRAINER.FSNL.USE_CAPTION True \
  DATASET.NUM_SHOTS 4 \
  DATASET.SUB_CLASSES OOP

Evaluation Only

python train.py \
  --root /path/to/datasets \
  --seed 1 \
  --trainer FSNL_openclip \
  --dataset-config-file configs/datasets/Animals92_42.yaml \
  --config-file configs/trainers/FSNL_openclip/vit_b16_ep100.yaml \
  --output-dir output/eval \
  --model-dir output/fsnl_animals_4shot \
  --load-epoch 100 \
  --eval-only \
  DATASET.NUM_SHOTS 4 \
  DATASET.SUB_CLASSES OOP

Linear Probe Baseline

# Step 1: Extract CLIP features
python lpclip/feat_extractor.py --dataset Animals92_42 --root /path/to/datasets

# Step 2: Run linear probe
python lpclip/linear_probe.py --dataset Animals92_42 --feature_dir clip_feat

Key Configuration Flags

Flag Default Description
TRAINER.FSNL.FLE False Fixed-length embedding. False = match original BPE token count per class
TRAINER.FSNL.N_CTX 4 Token count when FLE=True
TRAINER.FSNL.CIFC True Initialize from CLIP token embeddings (True) or random (False)
TRAINER.FSNL.USE_CAPTION True Enable caption-based contrastive loss alongside the base loss
DATASET.NUM_SHOTS Number of training images per class (e.g., 1, 2, 4, 8, 16)
DATASET.SUB_CLASSES OOP Which split to use: OOP, IP, or All

Available Trainers

Trainer name Description
FSNL_openclip Ours: per-class name embedding learning with OpenCLIP ViT-B/16
FSNL_openclip_2B FSNL with EVA-CLIP 2B backbone
FSNL_openclip_400M_L14 FSNL with OpenCLIP ViT-L/14 (400M)
CoOp_openclip CoOp baseline with OpenCLIP
CoCoOp_openclip CoCoOp baseline with OpenCLIP
ZeroshotCLIP2_openclip Zero-shot evaluation baseline
CoOp / CoCoOp Original CoOp/CoCoOp with OpenAI CLIP

GCD Experiments (Supplementary)

Generalized Category Discovery (GCD) experiments from the supplementary material use FSNL_openclip_GCD trainer and the LAION_Beyond_GCD dataset. Scripts are in scripts/FSNL_openclip_GCD/.


Citation

If you use this code or the LAION-Beyond benchmark in your research, please cite:

@inproceedings{chen2025reproducible,
  title={Reproducible vision-language models meet concepts out of pre-training},
  author={Chen, Ziliang and Huang, Xin and Fan, Xiaoxuan and Wang, Keze and Zhou, Yuyu and Guan, Quanlong and Lin, Liang},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={14701--14711},
  year={2025}
}

Acknowledgements

This codebase is built on Dassl and CoOp. We thank the authors for their open-source contributions.

About

Reproducible Vision-Language Models Meet Concepts Out of Pre-Training (CVPR2025)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors