
Diffusion Models From Scratch
Last Updated on April 15, 2025 by Editorial Team
Author(s): Barhoumi Mosbeh
Originally published on Towards AI.

Remember when AI-generated images looked like abstract art someone made after three espressos? Youβd type βastronaut riding a horse,β and get back something vaguely astronaut-shaped, vaguely horse-shaped, mostlyβ¦ blob-shaped. Yeah, those were the days.
Fast forward to now. Weβre living in an era of stunning AI art, from Midjourneyβs hyper-realistic scenes to Stable Diffusionβs creative outputs. Type in a prompt, and poof, you get something that often looks like magic. A huge chunk of that βmagicβ comes down to a class of models that have absolutely exploded in popularity: Diffusion Models.
But how do they actually work? If youβre like me, seeing βdiffusionβ might conjure up images of high school chemistry or maybe perfume spreading across a room. It turns out, the core idea isnβt that far off.
Stick with me. Whether youβre an ML engineer wondering if you should add these to your toolkit, a curious AI enthusiast, or just someone trying to understand the tech behind the cool pictures, weβre going to break down diffusion models, not quite line-by-line coding βfrom scratch,β but building the understanding from the ground up.
Setting the Stage: Why All the Fuss About Generative AI & Diffusion?
First off, what are we even talking about with βgenerative modelsβ? Simply put, these are AI systems designed to create new data that resembles the data they were trained on. Think generating images, music, text, even code.
For years, the stars of the image generation show were GANs (Generative Adversarial Networks) and VAEs (Variational Autoencoders). GANs are like an art forger trying to fool a detective, they learn by competing. VAEs learn to compress data into a compact form and then reconstruct it. They both work, but often struggle with training stability (GANs) or generating slightly blurrier images (VAEs).
Then, diffusion models arrived and started delivering state-of-the-art results, particularly in image quality and diversity. Models like GPT-4o, Imagen, Flux, and Stable Diffusion blew people away, and they all heavily rely on diffusion principles. Thatβs why everyoneβs talking about them.
How Diffusion Models Work: From Noise to Art (and Back Again)
Okay, the core concept. Imagine you have a beautiful, clear photograph. Now, imagine slowly adding tiny specks of random noise, step by step, until the original image is completely drowned out, leaving only static, like an old TV screen with no signal.
Thatβs the Forward Process (or Diffusion Process). Itβs mathematically defined, straightforward, and involves gradually adding Gaussian noise over a series of time steps (letβs say T steps). We know exactly how much noise weβre adding at each step. Easy peasy.

Now, hereβs where the magic happens: What if we could reverse this?
What if we could train a model to look at a noisy image and predict just the noise that was added at a particular step? If our model can accurately predict the noise, we can subtract it, taking a small step back towards the original, cleaner image.
Thatβs the Reverse Process. We start with pure random noise (like the static TV screen) and feed it into our trained model. The model predicts the noise present in the input. We subtract this predicted noise (or a version of it) and repeat this process for T steps. Each step denoises the image slightly, gradually revealing a coherent image that looks like it could have come from the original training data.

Think of it like sculpting.
- Forward Process: Imagine a finished statue slowly dissolving into a block of marble dust.
- Reverse Process: You start with the block of marble dust (noise) and carefully remove bits (predicted noise) step-by-step, revealing the statue hidden within.
The βmodelβ doing this heavy lifting in the reverse process is typically a sophisticated neural network architecture like a U-Net (commonly used in image segmentation, adapted here for noise prediction). It takes the noisy image and the current time step as input and outputs the predicted noise.
βFrom Scratchβ Thinking: What Do We Mean?
When we say βfrom scratchβ here, weβre focusing on understanding the mechanisms and concepts from their foundations. Are we going to code a whole diffusion model in this article? Nope. Thatβs a complex engineering task involving intricate network architectures (like U-Nets with attention mechanisms), careful noise scheduling (how much noise to add/remove at each step), and usually hefty compute resources for training.
But βfrom scratchβ thinking means grasping:
- The two core processes: Forward (adding noise) and Reverse (learning to remove it).
- The goal: Train a model to predict the noise added at any given step.
- The input/output: Start with noise, iteratively denoise using the modelβs predictions.
Understanding this conceptually is the crucial first step before diving into libraries like Hugging Faceβs diffusers
or PyTorch implementations. You need to know what the code is trying to achieve.
How Do Diffusion Models Stack Up Against the Others?

Letβs quickly compare:
- vs. GANs: Diffusion models generally produce higher-quality, more diverse images and are more stable to train (no tricky adversarial balance). However, GANs are typically much faster at generating images once trained. Diffusion requires multiple iterative steps (though progress is being made here!). Think of GANs as a high-stakes sprint (forger vs. detective), while diffusion is more like a meticulous marathon (sculpting noise).
- vs. VAEs: VAEs are great for learning meaningful compressed representations (latent spaces) and are usually faster than diffusion. Diffusion models often achieve better raw generation quality, avoiding the slight blurriness sometimes seen in VAE outputs.
- vs. Autoregressive Models (e.g., PixelCNN): These models generate images pixel by pixel, which can be very slow. Diffusion models generate the whole image iteratively and generally produce more globally coherent results.
Diffusion models offer a compelling trade-off: excellent sample quality and training stability, at the cost of slower sampling speed (though techniques like DDIM and score-based distillation are closing this gap).
Why Should You Care? The Real-World Impact
Okay, cool tech, but why does it matter to an ML engineer, developer, or creator?
- State-of-the-Art Generation: For tasks requiring high-fidelity image or even audio generation, diffusion models are often the top choice right now.
- Beyond Images: While famous for images, diffusion principles are being applied to audio generation (like WaveGrad), video, and 3D shape generation. Thereβs even growing research into using them for text generation, although NLP is still largely dominated by Transformers.
- Controllability: Techniques are rapidly evolving to allow fine-grained control over diffusion outputs using text prompts (like Stable Diffusion), image inputs (img2img), or other conditioning information. This opens up huge creative and practical possibilities.
- Understanding the Frontier: Knowing how these models work helps you understand the capabilities and limitations of modern generative AI, whether youβre building with it, using it, or just evaluating its impact.
The Catch: Challenges and Common Pitfalls

Itβs not all sunshine and perfectly generated cats.
- Slow Sampling: The iterative denoising process (often hundreds or thousands of steps) makes generation slower than single-pass models like GANs. This is a major area of research.
- Computationally Intensive Training: Training these models requires significant compute resources and large datasets.
- Understanding vs. Implementation: Grasping the concept is one thing; implementing an efficient and effective diffusion model requires careful engineering (network architecture, noise schedules, conditioning mechanisms).
- βMagicβ Misconception: They arenβt magic. They learn patterns from data. Biases in the training data will be reflected (and sometimes amplified) in the generated outputs.
Wrapping Up: The Noise Is the Signal
So, what have we learned?
- Diffusion Models generate data by reversing a process of gradually adding noise.
- They start with pure noise and use a trained model to iteratively predict and remove noise, step-by-step, until a clean sample emerges.
- The core idea involves a fixed Forward Process (adding noise) and a learned Reverse Process (removing noise).
- They offer state-of-the-art quality for generation tasks, especially images, outperforming older methods like GANs and VAEs in many benchmarks.
- Key advantages include high sample quality and stable training.
- The main drawback is typically slower sampling speed compared to models like GANs, although this is improving.
- βFrom Scratchβ understanding involves grasping these core mechanics, not necessarily coding the whole thing immediately.
Diffusion models represent a powerful and elegant approach to generative modeling. By learning to controllably reverse the process of destruction (adding noise), they learn to create. Itβs a beautiful concept, transforming random static into coherent, often stunning, outputs.
Dive Deeper: Further Resources
Ready to go beyond the conceptual? Here are some great starting points:
Papers:
- Denoising Diffusion Probabilistic Models (DDPM) by Ho et al. (The foundational modern paper): https://arxiv.org/abs/2006.11239
- Denoising Diffusion Implicit Models (DDIM) by Song et al. (Faster sampling): https://arxiv.org/abs/2010.02502
- High-Resolution Image Synthesis with Latent Diffusion Models (Stable Diffusion paper): https://arxiv.org/abs/2112.10752
- Blog Posts/Tutorials:
- Lilian Wengβs Blog: βWhat are Diffusion Models?β (Excellent technical overview): https://lilianweng.github.io/posts/2021-07-11-diffusion-models/
- AssemblyAI Blog: βDiffusion Models for Beginnersβ (Good visual explanations): https://www.assemblyai.com/blog/diffusion-models-for-machine-learning-introduction/
Code:
- Hugging Face Diffusers Library (The easiest way to try them out!): https://github.com/huggingface/diffusers
- Phil Wang (lucidrains) Pytorch Implementations (Often minimal, great for learning): Search GitHub for
lucidrains diffusion
Hopefully, this gives you a solid foundation for understanding diffusion models. Theyβre a fascinating area of research and a powerful tool in the generative AI landscape. Go forth and denoise!
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI