From Noise to Structure: Building a Diffusion Model from Scratch

In my previous posts, we've explored how models like CLIP can encode semantic meaning in images and how contrastive learning can help us learn features without labels. Today, we're diving into one of the most exciting developments in generative AI: Diffusion Models.

Models like Stable Diffusion and DALL-E seem like magic that turns random static into detailed art. But, under the hood, they rely on a beautifully simple idea grounded in probability and physics.

In this post, we will demystify diffusion by building a minimal, reproducible example in PyTorch. Instead of generating complex faces or landscapes, we will tackle a 2D toy problem: learning a Checkerboard distribution. This allows us to visualize exactly what the model is doing at every single step.

The Goal: Order from Chaos

Our objective is to train a neural network that takes pure Gaussian noise (random points) and progressively "denoises" them until they form a structured checkerboard pattern.

Here is the transformation we want to achieve:

(Left: The real data distribution. Right: The model's output, reconstructed from pure noise.)

And here is the full process in action. Watch as the points, initially scattered in chaos, slowly drift into the structured manifold of the checkerboard:

How do we get here? Let's break down the math and the code.

The Math: Forward and Reverse Processes

A diffusion model consists of two processes:

Forward Process (Diffusion): We gradually destroy structure by adding noise.
Reverse Process (Generation): We train a model to restore structure by removing noise.

1. The Forward Process ($q$)

Given a data point $x_0$ from our real distribution, we add small amounts of Gaussian noise over $T$ timesteps to produce a sequence $x_1, \dots, x_T$. As $T \to \infty$, $x_T$ becomes pure isotropic Gaussian noise.

The transition is defined as: $$q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I)$$

where $\beta_t$ is our noise schedule.

Crucially, we don't need to step through iteratively to train. Thanks to the properties of Gaussians, we can sample $x_t$ at any arbitrary timestep $t$ directly from $x_0$: $$x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon$$ where $\epsilon \sim \mathcal{N}(0, I)$ and $\bar{\alpha}_t$ is the cumulative product of $(1 - \beta_s)$.

2. The Reverse Process ($p_\theta$)

The magic happens in the reverse direction. If we could sample from $q(x_{t-1} | x_t)$, we could turn noise back into data. Since we can't calculate this directly, we approximate it with a neural network:

$$p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))$$

In practice, following the DDPM (Denoising Diffusion Probabilistic Models) paper, we parameterize our model to predict the noise $\epsilon$ that was added to $x_0$.

3. The Objective

We train our network $\epsilon_\theta(x_t, t)$ to minimize the Mean Squared Error (MSE) between the actual noise $\epsilon$ and the predicted noise:

$$L_{simple} = \mathbb{E}_{t, x_0, \epsilon} [ | \epsilon - \epsilon_\theta(x_t, t) |^2 ]$$

This is the beauty of it: we are simply training a function approximator to "guess the noise."

The Implementation

We'll build this using PyTorch. The full code is available in the repo, but here are the core components.

1. The Data

We generate 2D points from a checkerboard pattern. This gives us a clear ground truth to visually verify our results.

def generate_checkerboard_data(n_samples=10_000):
    x1 = np.random.uniform(0, 4, n_samples)
    x2 = np.random.uniform(0, 4, n_samples)
    mask = (np.floor(x1) % 2 + np.floor(x2) % 2) % 2 == 0
    data = np.vstack([x1[mask], x2[mask]]).T
    return torch.tensor((data - 2) / 2, dtype=torch.float32)  # Normalize to [-1, 1]

2. The Model

Since our data is 2D, we don't need a complex U-Net or Transformers. A simple Multi-Layer Perceptron (MLP) with appending $t$ as a feature works perfectly.

class SimpleDiffusionModel(nn.Module):
    def __init__(self, hidden_dim=128):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(2 + 1, hidden_dim),  # Input: x, y, and time t
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 2),  # Output: predicted noise (epsilon)
        )

    def forward(self, x, t):
        x_t = torch.cat([x, t], dim=1)
        return self.net(x_t)

3. The Training Loop

The training loop directly implements the math we discussed:

Sample real data $x_0$.
Sample a random timestep $t$.
Sample random noise $\epsilon$.
Compute the noisy input $x_t$ (using the "nice property of Gaussians").
Predict the noise and backpropagate the loss.

# ... inside the training loop ...
for x0 in dataloader:
    t = torch.randint(0, n_steps, (x0.shape[0],)).long().to(device)
    epsilon = torch.randn_like(x0)

    # Create noisy sample x_t
    a_bar_t = alpha_bar[t].unsqueeze(1)
    xt = torch.sqrt(a_bar_t) * x0 + torch.sqrt(1 - a_bar_t) * epsilon

    # Predict the noise
    t_input = (t / n_steps).unsqueeze(1).float()
    predicted_epsilon = model(xt, t_input)

    # Optimize
    loss = nn.MSELoss()(predicted_epsilon, epsilon)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Running the Code

I've structured the code so you can train the model and visualize the results separately. This is great for tweaking the visualization without re-running the expensive training loop.

1. Train the Model

First, train the model for 5000 epochs (which takes ~10 minutes on a GPU) and save the checkpoint:

python main.py --mode train --epochs 5000 --checkpoint my_model.pth

Output:

Training with 100 steps for 5000 epochs...
Using device: cuda
Starting training...
Epoch 0, Average Loss: 0.637849
...
Epoch 4500, Average Loss: 0.512942
Model saved to my_model.pth

2. Visualize the Results

Then, load the checkpoint to generate the static plot (diffusion_checkerboard.png) and the animation (diffusion_process.gif):

python main.py --mode visualize --checkpoint my_model.pth

Output:

Loading model from my_model.pth...
Sampling...
Saved plot to diffusion_checkerboard.png
Saved animation to diffusion_process.gif

Why Does This Work?

It helps to think of the reverse process as iterative refinement.

At step $T=100$ (pure noise), the model effectively asks: "If this random point was generated by adding noise to a checkerboard, what direction should I push it to make it look slightly less noisy?"

Because the noise schedule $\beta_t$ is small, the model only needs to make a tiny adjustment at each step. It doesn't need to jump straight to the solution. It just needs to nudge the point slightly towards the data manifold. Over 100 small nudges, these points that start from complete randomness collectively migrate to the target distribution.

Conclusion

This toy example demonstrates the fundamental mechanics that power state-of-the-art generative models. Whether you are generating 2D points or $1024 \times 1024$ images, the core principle remains the same: learn to reverse the destruction of information.

I hope this gives you a clearer intuition for how diffusion models operate!

If you found this useful, please check out the full reproducible code on my GitHub and give it a star if you feel so inclined.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
diffusion_checkerboard.png		diffusion_checkerboard.png
diffusion_process.gif		diffusion_process.gif
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

From Noise to Structure: Building a Diffusion Model from Scratch

The Goal: Order from Chaos

The Math: Forward and Reverse Processes

1. The Forward Process ($q$)

2. The Reverse Process ($p_\theta$)

3. The Objective

The Implementation

1. The Data

2. The Model

3. The Training Loop

Running the Code

Why Does This Work?

Conclusion

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

From Noise to Structure: Building a Diffusion Model from Scratch

The Goal: Order from Chaos

The Math: Forward and Reverse Processes

1. The Forward Process ($q$)

2. The Reverse Process ($p_\theta$)

3. The Objective

The Implementation

1. The Data

2. The Model

3. The Training Loop

Running the Code

Why Does This Work?

Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages