Diffusion Models From Scratch

Last Updated on April 15, 2025 by Editorial Team

Author(s): Barhoumi Mosbeh

Originally published on Towards AI.

Diffusion Models From Scratch — AI generated image

Remember when AI-generated images looked like abstract art someone made after three espressos? You’d type “astronaut riding a horse,” and get back something vaguely astronaut-shaped, vaguely horse-shaped, mostly… blob-shaped. Yeah, those were the days.

Fast forward to now. We’re living in an era of stunning AI art, from Midjourney’s hyper-realistic scenes to Stable Diffusion’s creative outputs. Type in a prompt, and poof, you get something that often looks like magic. A huge chunk of that “magic” comes down to a class of models that have absolutely exploded in popularity: Diffusion Models.

But how do they actually work? If you’re like me, seeing “diffusion” might conjure up images of high school chemistry or maybe perfume spreading across a room. It turns out, the core idea isn’t that far off.

Stick with me. Whether you’re an ML engineer wondering if you should add these to your toolkit, a curious AI enthusiast, or just someone trying to understand the tech behind the cool pictures, we’re going to break down diffusion models, not quite line-by-line coding “from scratch,” but building the understanding from the ground up.

Setting the Stage: Why All the Fuss About Generative AI & Diffusion?

First off, what are we even talking about with “generative models”? Simply put, these are AI systems designed to create new data that resembles the data they were trained on. Think generating images, music, text, even code.

For years, the stars of the image generation show were GANs (Generative Adversarial Networks) and VAEs (Variational Autoencoders). GANs are like an art forger trying to fool a detective, they learn by competing. VAEs learn to compress data into a compact form and then reconstruct it. They both work, but often struggle with training stability (GANs) or generating slightly blurrier images (VAEs).

Then, diffusion models arrived and started delivering state-of-the-art results, particularly in image quality and diversity. Models like GPT-4o, Imagen, Flux, and Stable Diffusion blew people away, and they all heavily rely on diffusion principles. That’s why everyone’s talking about them.

How Diffusion Models Work: From Noise to Art (and Back Again)

Okay, the core concept. Imagine you have a beautiful, clear photograph. Now, imagine slowly adding tiny specks of random noise, step by step, until the original image is completely drowned out, leaving only static, like an old TV screen with no signal.

That’s the Forward Process (or Diffusion Process). It’s mathematically defined, straightforward, and involves gradually adding Gaussian noise over a series of time steps (let’s say T steps). We know exactly how much noise we’re adding at each step. Easy peasy.

Now, here’s where the magic happens: What if we could reverse this?

What if we could train a model to look at a noisy image and predict just the noise that was added at a particular step? If our model can accurately predict the noise, we can subtract it, taking a small step back towards the original, cleaner image.

That’s the Reverse Process. We start with pure random noise (like the static TV screen) and feed it into our trained model. The model predicts the noise present in the input. We subtract this predicted noise (or a version of it) and repeat this process for T steps. Each step denoises the image slightly, gradually revealing a coherent image that looks like it could have come from the original training data.

Think of it like sculpting.

Forward Process: Imagine a finished statue slowly dissolving into a block of marble dust.
Reverse Process: You start with the block of marble dust (noise) and carefully remove bits (predicted noise) step-by-step, revealing the statue hidden within.

The “model” doing this heavy lifting in the reverse process is typically a sophisticated neural network architecture like a U-Net (commonly used in image segmentation, adapted here for noise prediction). It takes the noisy image and the current time step as input and outputs the predicted noise.

“From Scratch” Thinking: What Do We Mean?

When we say “from scratch” here, we’re focusing on understanding the mechanisms and concepts from their foundations. Are we going to code a whole diffusion model in this article? Nope. That’s a complex engineering task involving intricate network architectures (like U-Nets with attention mechanisms), careful noise scheduling (how much noise to add/remove at each step), and usually hefty compute resources for training.

But “from scratch” thinking means grasping:

The two core processes: Forward (adding noise) and Reverse (learning to remove it).
The goal: Train a model to predict the noise added at any given step.
The input/output: Start with noise, iteratively denoise using the model’s predictions.

Understanding this conceptually is the crucial first step before diving into libraries like Hugging Face’s diffusers or PyTorch implementations. You need to know what the code is trying to achieve.

How Do Diffusion Models Stack Up Against the Others?

Let’s quickly compare:

vs. GANs: Diffusion models generally produce higher-quality, more diverse images and are more stable to train (no tricky adversarial balance). However, GANs are typically much faster at generating images once trained. Diffusion requires multiple iterative steps (though progress is being made here!). Think of GANs as a high-stakes sprint (forger vs. detective), while diffusion is more like a meticulous marathon (sculpting noise).
vs. VAEs: VAEs are great for learning meaningful compressed representations (latent spaces) and are usually faster than diffusion. Diffusion models often achieve better raw generation quality, avoiding the slight blurriness sometimes seen in VAE outputs.
vs. Autoregressive Models (e.g., PixelCNN): These models generate images pixel by pixel, which can be very slow. Diffusion models generate the whole image iteratively and generally produce more globally coherent results.

Diffusion models offer a compelling trade-off: excellent sample quality and training stability, at the cost of slower sampling speed (though techniques like DDIM and score-based distillation are closing this gap).

Why Should You Care? The Real-World Impact

Okay, cool tech, but why does it matter to an ML engineer, developer, or creator?

State-of-the-Art Generation: For tasks requiring high-fidelity image or even audio generation, diffusion models are often the top choice right now.
Beyond Images: While famous for images, diffusion principles are being applied to audio generation (like WaveGrad), video, and 3D shape generation. There’s even growing research into using them for text generation, although NLP is still largely dominated by Transformers.
Controllability: Techniques are rapidly evolving to allow fine-grained control over diffusion outputs using text prompts (like Stable Diffusion), image inputs (img2img), or other conditioning information. This opens up huge creative and practical possibilities.
Understanding the Frontier: Knowing how these models work helps you understand the capabilities and limitations of modern generative AI, whether you’re building with it, using it, or just evaluating its impact.

The Catch: Challenges and Common Pitfalls

It’s not all sunshine and perfectly generated cats.

Slow Sampling: The iterative denoising process (often hundreds or thousands of steps) makes generation slower than single-pass models like GANs. This is a major area of research.
Computationally Intensive Training: Training these models requires significant compute resources and large datasets.
Understanding vs. Implementation: Grasping the concept is one thing; implementing an efficient and effective diffusion model requires careful engineering (network architecture, noise schedules, conditioning mechanisms).
“Magic” Misconception: They aren’t magic. They learn patterns from data. Biases in the training data will be reflected (and sometimes amplified) in the generated outputs.

Wrapping Up: The Noise Is the Signal

So, what have we learned?

Diffusion Models generate data by reversing a process of gradually adding noise.
They start with pure noise and use a trained model to iteratively predict and remove noise, step-by-step, until a clean sample emerges.
The core idea involves a fixed Forward Process (adding noise) and a learned Reverse Process (removing noise).
They offer state-of-the-art quality for generation tasks, especially images, outperforming older methods like GANs and VAEs in many benchmarks.
Key advantages include high sample quality and stable training.
The main drawback is typically slower sampling speed compared to models like GANs, although this is improving.
“From Scratch” understanding involves grasping these core mechanics, not necessarily coding the whole thing immediately.

Diffusion models represent a powerful and elegant approach to generative modeling. By learning to controllably reverse the process of destruction (adding noise), they learn to create. It’s a beautiful concept, transforming random static into coherent, often stunning, outputs.

Dive Deeper: Further Resources

Ready to go beyond the conceptual? Here are some great starting points:

Papers:

Denoising Diffusion Probabilistic Models (DDPM) by Ho et al. (The foundational modern paper): https://arxiv.org/abs/2006.11239
Denoising Diffusion Implicit Models (DDIM) by Song et al. (Faster sampling): https://arxiv.org/abs/2010.02502
High-Resolution Image Synthesis with Latent Diffusion Models (Stable Diffusion paper): https://arxiv.org/abs/2112.10752
Blog Posts/Tutorials:
Lilian Weng’s Blog: “What are Diffusion Models?” (Excellent technical overview): https://lilianweng.github.io/posts/2021-07-11-diffusion-models/
AssemblyAI Blog: “Diffusion Models for Beginners” (Good visual explanations): https://www.assemblyai.com/blog/diffusion-models-for-machine-learning-introduction/

Code:

Hugging Face Diffusers Library (The easiest way to try them out!): https://github.com/huggingface/diffusers
Phil Wang (lucidrains) Pytorch Implementations (Often minimal, great for learning): Search GitHub for lucidrains diffusion

Hopefully, this gives you a solid foundation for understanding diffusion models. They’re a fascinating area of research and a powerful tool in the generative AI landscape. Go forth and denoise!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Diffusion Models From Scratch

Author(s): Barhoumi Mosbeh

Setting the Stage: Why All the Fuss About Generative AI & Diffusion?

How Diffusion Models Work: From Noise to Art (and Back Again)

“From Scratch” Thinking: What Do We Mean?

How Do Diffusion Models Stack Up Against the Others?

Why Should You Care? The Real-World Impact

The Catch: Challenges and Common Pitfalls

Wrapping Up: The Noise Is the Signal

Dive Deeper: Further Resources

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Diffusion Models From Scratch

Author(s): Barhoumi Mosbeh

Setting the Stage: Why All the Fuss About Generative AI & Diffusion?

How Diffusion Models Work: From Noise to Art (and Back Again)

“From Scratch” Thinking: What Do We Mean?

How Do Diffusion Models Stack Up Against the Others?

Why Should You Care? The Real-World Impact

The Catch: Challenges and Common Pitfalls

Wrapping Up: The Noise Is the Signal

Dive Deeper: Further Resources

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement