Skip to content

Pdbz199/simple-diffusion-model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

From Noise to Structure: Building a Diffusion Model from Scratch

In my previous posts, we've explored how models like CLIP can encode semantic meaning in images and how contrastive learning can help us learn features without labels. Today, we're diving into one of the most exciting developments in generative AI: Diffusion Models.

Models like Stable Diffusion and DALL-E seem like magic that turns random static into detailed art. But, under the hood, they rely on a beautifully simple idea grounded in probability and physics.

In this post, we will demystify diffusion by building a minimal, reproducible example in PyTorch. Instead of generating complex faces or landscapes, we will tackle a 2D toy problem: learning a Checkerboard distribution. This allows us to visualize exactly what the model is doing at every single step.

The Goal: Order from Chaos

Our objective is to train a neural network that takes pure Gaussian noise (random points) and progressively "denoises" them until they form a structured checkerboard pattern.

Here is the transformation we want to achieve:

Noise vs Generated (Left: The real data distribution. Right: The model's output, reconstructed from pure noise.)

And here is the full process in action. Watch as the points, initially scattered in chaos, slowly drift into the structured manifold of the checkerboard:

Diffusion Process

How do we get here? Let's break down the math and the code.


The Math: Forward and Reverse Processes

A diffusion model consists of two processes:

  1. Forward Process (Diffusion): We gradually destroy structure by adding noise.
  2. Reverse Process (Generation): We train a model to restore structure by removing noise.

1. The Forward Process ($q$)

Given a data point $x_0$ from our real distribution, we add small amounts of Gaussian noise over $T$ timesteps to produce a sequence $x_1, \dots, x_T$. As $T \to \infty$, $x_T$ becomes pure isotropic Gaussian noise.

The transition is defined as: $$q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I)$$

where $\beta_t$ is our noise schedule.

Crucially, we don't need to step through iteratively to train. Thanks to the properties of Gaussians, we can sample $x_t$ at any arbitrary timestep $t$ directly from $x_0$: $$x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon$$ where $\epsilon \sim \mathcal{N}(0, I)$ and $\bar{\alpha}_t$ is the cumulative product of $(1 - \beta_s)$.

2. The Reverse Process ($p_\theta$)

The magic happens in the reverse direction. If we could sample from $q(x_{t-1} | x_t)$, we could turn noise back into data. Since we can't calculate this directly, we approximate it with a neural network:

$$p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))$$

In practice, following the DDPM (Denoising Diffusion Probabilistic Models) paper, we parameterize our model to predict the noise $\epsilon$ that was added to $x_0$.

3. The Objective

We train our network $\epsilon_\theta(x_t, t)$ to minimize the Mean Squared Error (MSE) between the actual noise $\epsilon$ and the predicted noise:

$$L_{simple} = \mathbb{E}_{t, x_0, \epsilon} [ | \epsilon - \epsilon_\theta(x_t, t) |^2 ]$$

This is the beauty of it: we are simply training a function approximator to "guess the noise."


The Implementation

We'll build this using PyTorch. The full code is available in the repo, but here are the core components.

1. The Data

We generate 2D points from a checkerboard pattern. This gives us a clear ground truth to visually verify our results.

def generate_checkerboard_data(n_samples=10_000):
    x1 = np.random.uniform(0, 4, n_samples)
    x2 = np.random.uniform(0, 4, n_samples)
    mask = (np.floor(x1) % 2 + np.floor(x2) % 2) % 2 == 0
    data = np.vstack([x1[mask], x2[mask]]).T
    return torch.tensor((data - 2) / 2, dtype=torch.float32)  # Normalize to [-1, 1]

2. The Model

Since our data is 2D, we don't need a complex U-Net or Transformers. A simple Multi-Layer Perceptron (MLP) with appending $t$ as a feature works perfectly.

class SimpleDiffusionModel(nn.Module):
    def __init__(self, hidden_dim=128):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(2 + 1, hidden_dim),  # Input: x, y, and time t
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 2),  # Output: predicted noise (epsilon)
        )

    def forward(self, x, t):
        x_t = torch.cat([x, t], dim=1)
        return self.net(x_t)

3. The Training Loop

The training loop directly implements the math we discussed:

  1. Sample real data $x_0$.
  2. Sample a random timestep $t$.
  3. Sample random noise $\epsilon$.
  4. Compute the noisy input $x_t$ (using the "nice property of Gaussians").
  5. Predict the noise and backpropagate the loss.
# ... inside the training loop ...
for x0 in dataloader:
    t = torch.randint(0, n_steps, (x0.shape[0],)).long().to(device)
    epsilon = torch.randn_like(x0)

    # Create noisy sample x_t
    a_bar_t = alpha_bar[t].unsqueeze(1)
    xt = torch.sqrt(a_bar_t) * x0 + torch.sqrt(1 - a_bar_t) * epsilon

    # Predict the noise
    t_input = (t / n_steps).unsqueeze(1).float()
    predicted_epsilon = model(xt, t_input)

    # Optimize
    loss = nn.MSELoss()(predicted_epsilon, epsilon)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Running the Code

I've structured the code so you can train the model and visualize the results separately. This is great for tweaking the visualization without re-running the expensive training loop.

1. Train the Model

First, train the model for 5000 epochs (which takes ~10 minutes on a GPU) and save the checkpoint:

python main.py --mode train --epochs 5000 --checkpoint my_model.pth

Output:

Training with 100 steps for 5000 epochs...
Using device: cuda
Starting training...
Epoch 0, Average Loss: 0.637849
...
Epoch 4500, Average Loss: 0.512942
Model saved to my_model.pth

2. Visualize the Results

Then, load the checkpoint to generate the static plot (diffusion_checkerboard.png) and the animation (diffusion_process.gif):

python main.py --mode visualize --checkpoint my_model.pth

Output:

Loading model from my_model.pth...
Sampling...
Saved plot to diffusion_checkerboard.png
Saved animation to diffusion_process.gif

Why Does This Work?

It helps to think of the reverse process as iterative refinement.

At step $T=100$ (pure noise), the model effectively asks: "If this random point was generated by adding noise to a checkerboard, what direction should I push it to make it look slightly less noisy?"

Because the noise schedule $\beta_t$ is small, the model only needs to make a tiny adjustment at each step. It doesn't need to jump straight to the solution. It just needs to nudge the point slightly towards the data manifold. Over 100 small nudges, these points that start from complete randomness collectively migrate to the target distribution.

Conclusion

This toy example demonstrates the fundamental mechanics that power state-of-the-art generative models. Whether you are generating 2D points or $1024 \times 1024$ images, the core principle remains the same: learn to reverse the destruction of information.

I hope this gives you a clearer intuition for how diffusion models operate!

If you found this useful, please check out the full reproducible code on my GitHub and give it a star if you feel so inclined.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages