Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-FranΓ§ois Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Diffusion Models From Scratch
Latest   Machine Learning

Diffusion Models From Scratch

Last Updated on April 15, 2025 by Editorial Team

Author(s): Barhoumi Mosbeh

Originally published on Towards AI.

Diffusion Models From Scratch
AI generated image

Remember when AI-generated images looked like abstract art someone made after three espressos? You’d type β€œastronaut riding a horse,” and get back something vaguely astronaut-shaped, vaguely horse-shaped, mostly… blob-shaped. Yeah, those were the days.

Fast forward to now. We’re living in an era of stunning AI art, from Midjourney’s hyper-realistic scenes to Stable Diffusion’s creative outputs. Type in a prompt, and poof, you get something that often looks like magic. A huge chunk of that β€œmagic” comes down to a class of models that have absolutely exploded in popularity: Diffusion Models.

But how do they actually work? If you’re like me, seeing β€œdiffusion” might conjure up images of high school chemistry or maybe perfume spreading across a room. It turns out, the core idea isn’t that far off.

Stick with me. Whether you’re an ML engineer wondering if you should add these to your toolkit, a curious AI enthusiast, or just someone trying to understand the tech behind the cool pictures, we’re going to break down diffusion models, not quite line-by-line coding β€œfrom scratch,” but building the understanding from the ground up.

Setting the Stage: Why All the Fuss About Generative AI & Diffusion?

First off, what are we even talking about with β€œgenerative models”? Simply put, these are AI systems designed to create new data that resembles the data they were trained on. Think generating images, music, text, even code.

For years, the stars of the image generation show were GANs (Generative Adversarial Networks) and VAEs (Variational Autoencoders). GANs are like an art forger trying to fool a detective, they learn by competing. VAEs learn to compress data into a compact form and then reconstruct it. They both work, but often struggle with training stability (GANs) or generating slightly blurrier images (VAEs).

Then, diffusion models arrived and started delivering state-of-the-art results, particularly in image quality and diversity. Models like GPT-4o, Imagen, Flux, and Stable Diffusion blew people away, and they all heavily rely on diffusion principles. That’s why everyone’s talking about them.

How Diffusion Models Work: From Noise to Art (and Back Again)

Okay, the core concept. Imagine you have a beautiful, clear photograph. Now, imagine slowly adding tiny specks of random noise, step by step, until the original image is completely drowned out, leaving only static, like an old TV screen with no signal.

That’s the Forward Process (or Diffusion Process). It’s mathematically defined, straightforward, and involves gradually adding Gaussian noise over a series of time steps (let’s say T steps). We know exactly how much noise we’re adding at each step. Easy peasy.

AI generated image

Now, here’s where the magic happens: What if we could reverse this?

What if we could train a model to look at a noisy image and predict just the noise that was added at a particular step? If our model can accurately predict the noise, we can subtract it, taking a small step back towards the original, cleaner image.

That’s the Reverse Process. We start with pure random noise (like the static TV screen) and feed it into our trained model. The model predicts the noise present in the input. We subtract this predicted noise (or a version of it) and repeat this process for T steps. Each step denoises the image slightly, gradually revealing a coherent image that looks like it could have come from the original training data.

AI generated image

Think of it like sculpting.

  • Forward Process: Imagine a finished statue slowly dissolving into a block of marble dust.
  • Reverse Process: You start with the block of marble dust (noise) and carefully remove bits (predicted noise) step-by-step, revealing the statue hidden within.

The β€œmodel” doing this heavy lifting in the reverse process is typically a sophisticated neural network architecture like a U-Net (commonly used in image segmentation, adapted here for noise prediction). It takes the noisy image and the current time step as input and outputs the predicted noise.

β€œFrom Scratch” Thinking: What Do We Mean?

When we say β€œfrom scratch” here, we’re focusing on understanding the mechanisms and concepts from their foundations. Are we going to code a whole diffusion model in this article? Nope. That’s a complex engineering task involving intricate network architectures (like U-Nets with attention mechanisms), careful noise scheduling (how much noise to add/remove at each step), and usually hefty compute resources for training.

But β€œfrom scratch” thinking means grasping:

  1. The two core processes: Forward (adding noise) and Reverse (learning to remove it).
  2. The goal: Train a model to predict the noise added at any given step.
  3. The input/output: Start with noise, iteratively denoise using the model’s predictions.

Understanding this conceptually is the crucial first step before diving into libraries like Hugging Face’s diffusers or PyTorch implementations. You need to know what the code is trying to achieve.

How Do Diffusion Models Stack Up Against the Others?

AI generated image

Let’s quickly compare:

  • vs. GANs: Diffusion models generally produce higher-quality, more diverse images and are more stable to train (no tricky adversarial balance). However, GANs are typically much faster at generating images once trained. Diffusion requires multiple iterative steps (though progress is being made here!). Think of GANs as a high-stakes sprint (forger vs. detective), while diffusion is more like a meticulous marathon (sculpting noise).
  • vs. VAEs: VAEs are great for learning meaningful compressed representations (latent spaces) and are usually faster than diffusion. Diffusion models often achieve better raw generation quality, avoiding the slight blurriness sometimes seen in VAE outputs.
  • vs. Autoregressive Models (e.g., PixelCNN): These models generate images pixel by pixel, which can be very slow. Diffusion models generate the whole image iteratively and generally produce more globally coherent results.

Diffusion models offer a compelling trade-off: excellent sample quality and training stability, at the cost of slower sampling speed (though techniques like DDIM and score-based distillation are closing this gap).

Why Should You Care? The Real-World Impact

Okay, cool tech, but why does it matter to an ML engineer, developer, or creator?

  1. State-of-the-Art Generation: For tasks requiring high-fidelity image or even audio generation, diffusion models are often the top choice right now.
  2. Beyond Images: While famous for images, diffusion principles are being applied to audio generation (like WaveGrad), video, and 3D shape generation. There’s even growing research into using them for text generation, although NLP is still largely dominated by Transformers.
  3. Controllability: Techniques are rapidly evolving to allow fine-grained control over diffusion outputs using text prompts (like Stable Diffusion), image inputs (img2img), or other conditioning information. This opens up huge creative and practical possibilities.
  4. Understanding the Frontier: Knowing how these models work helps you understand the capabilities and limitations of modern generative AI, whether you’re building with it, using it, or just evaluating its impact.

The Catch: Challenges and Common Pitfalls

AI generated image

It’s not all sunshine and perfectly generated cats.

  • Slow Sampling: The iterative denoising process (often hundreds or thousands of steps) makes generation slower than single-pass models like GANs. This is a major area of research.
  • Computationally Intensive Training: Training these models requires significant compute resources and large datasets.
  • Understanding vs. Implementation: Grasping the concept is one thing; implementing an efficient and effective diffusion model requires careful engineering (network architecture, noise schedules, conditioning mechanisms).
  • β€œMagic” Misconception: They aren’t magic. They learn patterns from data. Biases in the training data will be reflected (and sometimes amplified) in the generated outputs.

Wrapping Up: The Noise Is the Signal

So, what have we learned?

  • Diffusion Models generate data by reversing a process of gradually adding noise.
  • They start with pure noise and use a trained model to iteratively predict and remove noise, step-by-step, until a clean sample emerges.
  • The core idea involves a fixed Forward Process (adding noise) and a learned Reverse Process (removing noise).
  • They offer state-of-the-art quality for generation tasks, especially images, outperforming older methods like GANs and VAEs in many benchmarks.
  • Key advantages include high sample quality and stable training.
  • The main drawback is typically slower sampling speed compared to models like GANs, although this is improving.
  • β€œFrom Scratch” understanding involves grasping these core mechanics, not necessarily coding the whole thing immediately.

Diffusion models represent a powerful and elegant approach to generative modeling. By learning to controllably reverse the process of destruction (adding noise), they learn to create. It’s a beautiful concept, transforming random static into coherent, often stunning, outputs.

Dive Deeper: Further Resources

Ready to go beyond the conceptual? Here are some great starting points:

Papers:

Code:

  • Hugging Face Diffusers Library (The easiest way to try them out!): https://github.com/huggingface/diffusers
  • Phil Wang (lucidrains) Pytorch Implementations (Often minimal, great for learning): Search GitHub for lucidrains diffusion

Hopefully, this gives you a solid foundation for understanding diffusion models. They’re a fascinating area of research and a powerful tool in the generative AI landscape. Go forth and denoise!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓