In my previous posts, we've explored how models like CLIP can encode semantic meaning in images and how contrastive learning can help us learn features without labels. Today, we're diving into one of the most exciting developments in generative AI: Diffusion Models.
Models like Stable Diffusion and DALL-E seem like magic that turns random static into detailed art. But, under the hood, they rely on a beautifully simple idea grounded in probability and physics.
In this post, we will demystify diffusion by building a minimal, reproducible example in PyTorch. Instead of generating complex faces or landscapes, we will tackle a 2D toy problem: learning a Checkerboard distribution. This allows us to visualize exactly what the model is doing at every single step.
Our objective is to train a neural network that takes pure Gaussian noise (random points) and progressively "denoises" them until they form a structured checkerboard pattern.
Here is the transformation we want to achieve:
(Left: The real data distribution. Right: The model's output, reconstructed from pure noise.)
And here is the full process in action. Watch as the points, initially scattered in chaos, slowly drift into the structured manifold of the checkerboard:
How do we get here? Let's break down the math and the code.
A diffusion model consists of two processes:
- Forward Process (Diffusion): We gradually destroy structure by adding noise.
- Reverse Process (Generation): We train a model to restore structure by removing noise.
Given a data point
The transition is defined as:
where
Crucially, we don't need to step through iteratively to train. Thanks to the properties of Gaussians, we can sample
The magic happens in the reverse direction. If we could sample from
In practice, following the DDPM (Denoising Diffusion Probabilistic Models) paper, we parameterize our model to predict the noise
We train our network
This is the beauty of it: we are simply training a function approximator to "guess the noise."
We'll build this using PyTorch. The full code is available in the repo, but here are the core components.
We generate 2D points from a checkerboard pattern. This gives us a clear ground truth to visually verify our results.
def generate_checkerboard_data(n_samples=10_000):
x1 = np.random.uniform(0, 4, n_samples)
x2 = np.random.uniform(0, 4, n_samples)
mask = (np.floor(x1) % 2 + np.floor(x2) % 2) % 2 == 0
data = np.vstack([x1[mask], x2[mask]]).T
return torch.tensor((data - 2) / 2, dtype=torch.float32) # Normalize to [-1, 1]Since our data is 2D, we don't need a complex U-Net or Transformers. A simple Multi-Layer Perceptron (MLP) with appending
class SimpleDiffusionModel(nn.Module):
def __init__(self, hidden_dim=128):
super().__init__()
self.net = nn.Sequential(
nn.Linear(2 + 1, hidden_dim), # Input: x, y, and time t
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, 2), # Output: predicted noise (epsilon)
)
def forward(self, x, t):
x_t = torch.cat([x, t], dim=1)
return self.net(x_t)The training loop directly implements the math we discussed:
- Sample real data
$x_0$ . - Sample a random timestep
$t$ . - Sample random noise
$\epsilon$ . - Compute the noisy input
$x_t$ (using the "nice property of Gaussians"). - Predict the noise and backpropagate the loss.
# ... inside the training loop ...
for x0 in dataloader:
t = torch.randint(0, n_steps, (x0.shape[0],)).long().to(device)
epsilon = torch.randn_like(x0)
# Create noisy sample x_t
a_bar_t = alpha_bar[t].unsqueeze(1)
xt = torch.sqrt(a_bar_t) * x0 + torch.sqrt(1 - a_bar_t) * epsilon
# Predict the noise
t_input = (t / n_steps).unsqueeze(1).float()
predicted_epsilon = model(xt, t_input)
# Optimize
loss = nn.MSELoss()(predicted_epsilon, epsilon)
optimizer.zero_grad()
loss.backward()
optimizer.step()I've structured the code so you can train the model and visualize the results separately. This is great for tweaking the visualization without re-running the expensive training loop.
1. Train the Model
First, train the model for 5000 epochs (which takes ~10 minutes on a GPU) and save the checkpoint:
python main.py --mode train --epochs 5000 --checkpoint my_model.pthOutput:
Training with 100 steps for 5000 epochs...
Using device: cuda
Starting training...
Epoch 0, Average Loss: 0.637849
...
Epoch 4500, Average Loss: 0.512942
Model saved to my_model.pth
2. Visualize the Results
Then, load the checkpoint to generate the static plot (diffusion_checkerboard.png) and the animation (diffusion_process.gif):
python main.py --mode visualize --checkpoint my_model.pthOutput:
Loading model from my_model.pth...
Sampling...
Saved plot to diffusion_checkerboard.png
Saved animation to diffusion_process.gif
It helps to think of the reverse process as iterative refinement.
At step
Because the noise schedule
This toy example demonstrates the fundamental mechanics that power state-of-the-art generative models. Whether you are generating 2D points or
I hope this gives you a clearer intuition for how diffusion models operate!
If you found this useful, please check out the full reproducible code on my GitHub and give it a star if you feel so inclined.
