Hila Chefer* · Patrick Esser*
Dominik Lorenz · Dustin Podell · Vikash Raja · Vinh Tong · Antonio Torralba · Robin Rombach
Black Forest Labs
This folder contains inference code for generating images with our Self-Flow trained diffusion model on ImageNet 256×256.
Self-Flow (Self-Supervised Flow Matching for Scalable Multi-Modal Synthesis) is a training framework that combines the flow matching objective with a self-supervised feature reconstruction objective.
This inference code allows you to:
- Load a Self-Flow checkpoints (pretrained on ImageNet 256x256)
- Generate 50,000 images for FID evaluation
The generated samples can be evaluated using the ADM evaluation suite.
pip install -r requirements.txtpython -c "
from huggingface_hub import hf_hub_download
hf_hub_download(
repo_id='Hila/Self-Flow',
filename='selfflow_imagenet256.pt',
local_dir='./checkpoints'
)
print('Downloaded!')
"
torchrun --nnodes=1 --nproc_per_node=8 sample.py \
--ckpt checkpoints/selfflow_imagenet256.pt \
--output-dir ./samples \
--num-fid-samples 50000python sample.py \
--ckpt checkpoints/selfflow_imagenet256.pt \
--output-dir ./samples \
--num-fid-samples 50000 \
--batch-size 64| Argument | Default | Description |
|---|---|---|
--ckpt |
required | Path to model checkpoint |
--output-dir |
./samples |
Output directory for generated samples |
--num-fid-samples |
50000 |
Number of samples to generate |
--batch-size |
64 |
Batch size per GPU |
--num-steps |
250 |
Number of diffusion sampling steps |
--mode |
SDE |
Sampling mode: SDE or ODE |
--seed |
31 |
Random seed for reproducibility |
--cfg-scale |
1.0 |
Classifier-free guidance scale (1.0 = no guidance, as used in paper) |
The generated .npz file can be used with the ADM evaluation suite to compute FID, IS, Precision, and Recall.
wget https://openaipublic.blob.core.windows.net/diffusion/jul-2021/ref_batches/imagenet/256/VIRTUAL_imagenet256_labeled.npzpython evaluator.py \
VIRTUAL_imagenet256_labeled.npz \
./samples/samples_50000.npz ./samplesThe Self-Flow model is based on SiT-XL/2 with the following specifications
A key architectural modification is per-token timestep conditioning, which allows each token to have a different noise level during training.
Self-Flow/
├── sample.py # Main sampling script
├── checkpoints/ # Place model checkpoints here
├── requirements.txt # Python dependencies
├── README.md # This file
└── src/ # Model and sampling implementations
├── model.py # SelfFlowPerTokenDiT model
├── sampling.py # Diffusion sampling utilities
└── utils.py # Position encoding utilities
The model was trained using the following configuration:
- Model: SiT-XL/2 with per-token timestep conditioning
- Training: Self-Flow with per-token masking (25% mask ratio)
- Optimizer: AdamW with gradient clipping (max_norm=1)
- Mixed precision: BFloat16
- Self-distillation: Teacher at layer 20 (EMA), student at layer 8
This code builds upon:
If you use this work, please cite:
@article{CheferEsser2026selfflow,
title={Self-Supervised Flow Matching for Scalable Multi-Modal Synthesis},
author={Hila Chefer and Patrick Esser and Dominik Lorenz and Dustin Podell and Vikash Raja and Vinh Tong and Antonio Torralba and Robin Rombach},
journal = {arXiv preprint arXiv:2603.06507},
year={2026},
}