[AI/ML] Diffusion Models — A Beginner’s Guide to Math Behind Stable Diffusion and Dall-e!

Last Updated on November 17, 2024 by Editorial Team

Author(s): Shashwat Gupta

Originally published on Towards AI.

Generative modeling in computer vision has advanced significantly, with Diffusion Models leading the way and powering tools like DALL-E and Stable Diffusion. These models have transformed machine capabilities, offering new possibilities in art, design, and content creation. This blog introduces Diffusion Models, focusing on their intuition and mathematics for understanding related research. While much existing literature explains these models from a Markov Chain perspective, alternative perspectives and conditioning methods during generation remain underexplored.

This blog explores the mathematics behind two key perspectives of Diffusion Models —
1. Markov Chain Perspective
2. Langevin Dynamics Perspective (Noise-conditioned Score Generation)
while also explaining their architecture, conditioning mechanisms, and popular modifications.

Fig 1. Working of Diffusion Models (Source: https://theaisummer.com/diffusion-models/)

Introduction:

There are several types of generative models popular now (as shown in Figure 2), but none is without its flaws:

Generative Adversarial Networks (GANs): suffer from unstable training and limited diversity (mode collapse).
Variational Autoencoders (VAE) [8,9,23]: relies on a surrogate loss.
Flow-based models: need specialized architectures to construct reversible transforms.
4. Diffusion Models: inspired by non-equilibrium thermodynamics. Despite being slow at sampling, diffusion models outperform other generative models; specifically, they are free from the issues of these models. Below, we explain the common perspectives to understand of diffusion models, specifically the ones that are needed to understand our architectures.

Figure 2: Overview of Generative Models (Source: https://lilianweng.github.io/posts/2021-07-11-diffusion-models/)

1. Markov Chain Perspective:

We touch upon the necessary mathematical details of the diffusion models without diving into the proofs much (More detailed treatment can be found in [1,2,3,4]). Our approach will mostly be like Denoising Diffusion Probabilistic Model (DDPM) [5,6,7] with some improvements suggested in papers published by OpenAI later on [11,12].

Diffusion Models are latent space models that involve adding noise to a sample as a Markov chain and then denoising the noisy image using a neural network. During training, noise is added (according to a variance schedule), and a model is used to denoise the image in multiple steps. During inference, denoising is applied to an isotropic noisy sample. Noising and denoising in steps, as opposed to single steps like GANs, leads to more tractable computations [10].

The forward process is defined as follows:

As t → ∞, xₜ approaches an isotropic Gaussian. For the forward process, xₜ can be computed in closed form from x₀ by using a reparametrization trick involving the sum of two Gaussian.

Defining two new variables:

Since βₜ is small, q(xₜ_₁|xₜ) is also Gaussian. However, estimating this quantity would require using the entire dataset, so we learn a model p_θ to approximate the conditional probabilities.

We run the reverse diffusion process:

The forward and reverse processes are pictograhically shown as follows (Figure 3 and 4):

Figure 3. Forward Diffusion Process (Source: https://arxiv.org/abs/2006.11239)

Figure 4. Reverse Diffusion Process (Source: https://lilianweng.github.io/posts/2021-07-11-diffusion-models/#:~:text=Diffusion%20models%20are%20inspired%20by,data%20samples%20from%20the%20noise.)

We use simple likelihood for the loss: − log p_θ(x). Similar to VAEs, we use the variational lower bound (VLB) to upper bound of the objective [8,9]. Upon simplification and additional conditioning on x₀ (for better sampling), and ignoring pure q(xₜ) terms (since they have no learnable parameters), we come up with the following objective:

By observing that βₜ is fixed (as per the schedule), as an objective, we can minimize the MSE between ũt(xₜ, x₀) and µ_θ(xₜ). After simplification, this reduces to the MSE between the error at time t and the predicted error for time t predicted by the model, with a scaling term that improves sample quality.

Figure 5 : Diffusion models demystified (Source: https://www.researchgate.net/figure/A-visual-representation-of-a-Variational-Diffusion-ModelReference_fig2_372206420)

2. Langevin Dynamics Perspective (Noise-conditioned score networks):

This perspective enables us to understand Conditional image generation. Again we touch upon the results (more can be followed from [13,19,20,21,24]) Stochastic Gradient Langevin Dynamics [26] can generate samples from a probability density p(x) using only the gradients ∇ₓ log p(x) in a Markov chain of updates.

Here, δ represents the step size. As T → ∞, ϵ → 0, and x_T converges to the true probability density p(x).

Compared to standard SGD, stochastic gradient Langevin Dynamics injects Gaussian noise into the parameter updates to avoid collapsing into local minima.

Song and Ermon (2019) [13] proposed score-based generative modelling methods where samples are produced via Langevin dynamics using gradients of the data distribution estimated with Stein score-matching.

To scale with high-dimension, they add a pre-specified small noise to the data and estimate the data point with score matching. According to the manifold hypothesis, most data is expected to lie on a low-dimensional manifold, even though the data might seem to be in high dimension. Thus, the data does not cover the entire space, and estimation is unreliable in sparse regions. Adding a small perturbation in steps to cover the entire space offers more stable training

Figure 6. Role of Noise in Score-matching approach (Source: Generated by author after running model)

Architecture and Algorithm:

The original implementation of DDPMs used U-Net architecture consisted of Wide ResNet blocks, group normalisation as well as self-attention blocks.

Figure 7 : UNet architecture (Source: https://www.researchgate.net/publication/342903303_Applying_3D_U-Net_Architecture_to_the_Task_of_Multi-Organ_Segmentation_in_Computed_Tomography)

The main choices for the original DDPM Paper are:

The diffusion time step t is specified by adding a sinusoidal position embedding into each residual block. (The time embedding is similar to Sinusoidal Positional embedding in Transformers paper : Attention is all you Need)
The encoder and decoder paths have same count of levels with a bottleneck block between them. Each encoder stage consists of 2 residual block with convolutional downsampling, except for final level. The decoder state has 3 residual blocks. There are skip connections from encoder to decoder at each level.
Attention modules at single feature map resolution

Figure 8 : Diffusion Model Architecture based on U-Net (Source: https://learnopencv.com/denoising-diffusion-probabilistic-models/)

Various other approaches and architectures are covered in [15,18].

Training and Sampling:

Training Process:

In each batch of the training process, the following steps are taken:

Sampling a random timestep t for each training sample within the batch (e.g. images)
Adding Gaussian noise by using the closed-form formula, according to their timesteps t
Converting the timesteps into embeddings for feeding the U-Net or similar models (or other family of models)
Using the images with noise and time embeddings as input for predicting the noise present in the images
Comparing the predicted noise to the actual noise for calculating the loss function
Updating the Diffusion Model parameters via backpropagation using the loss function

This process repeats at each epoch, using the same images. However, different timesteps are usually sampled for each image at different epochs. This enables the model to learn reversing the diffusion process at any timestep, enhancing its adaptability

Sampling Process:

For sampling new images, the difference lies in that we don’t have an input image. We sample random Gaussian noise and define how much steps of noise (T) to take for generating the new images. In each step, the Diffusion Model predict the whole noise present in the image, taking as input the current timestep. Then, it removes just a fraction of this predicted noise. We obtain our image generation result after T inference steps.

The training and Sampling algorithms are shown in Figure 9.

Conditined Generation:

To turn a diffusion model into a conditioned model [22], we can add conditioning information (y) at each step with a guidance-scalar s as :

The above score-based formulation eliminates the term using p_θ(y), which needs knowledge of all data points.

The following are the popular ways to condition the diffusion model

1. Classifier Guided Diffusion The score of y wrt x can be estimated using a classifier [11]. Setting ∇_xₜ log q(y|xₜ) = ∇_xₜ log f_ϕ(y|xt)

The resulting ablated diffusion model (ADM) and the one with additional classifier guidance (ADM-G) can achieve better results than state-of-the-art generative models (e.g., BigGAN).

2. Classifier-free guidance Conditioning is also possible without a classifier [17]. Let unconditional denoising diffusion model p_θ(x) parameterized through a score estimator ϵ_θ(xₜ, t) and the conditional model p_θ(x|y) parameterized through ϵ_θ(xₜ, t, y). These two models can be learned via a single neural network. Precisely, a conditional diffusion model pθ(x|y) is trained on paired data (x, y), where the conditioning information y gets discarded periodically at random such that the model knows how to generate images unconditionally as well, i.e. ϵ_θ(xₜ, t) = ϵ_θ(xₜ, t, y = ϕ). The gradient of an implicit classifier can be represented with conditional and unconditional score estimators. Once plugged into the classifier-guided modified score, the score contains no dependency on a separate classifier.

Classifier-Guided and Classifier-Free Guidance methods differ primarily in their training requirements and flexibility of control over generated outputs. Classifier-Guided guidance necessitates training an additional classifier, typically using noisy images, to steer the diffusion process toward specific categories, but it does not require retraining the underlying diffusion model, allowing the use of pre-trained models as is. This approach limits control to the predefined classes that the classifier can recognize. In contrast, Classifier-Free Guidance leverages models like CLIP directly without needing a separate classifier, but it does require retraining the diffusion model to handle both conditional and unconditional data. This retraining enables much more flexible and nuanced control over the final output, allowing almost any condition to influence the generation process. Thus, while Classifier-Guided methods offer straightforward integration with existing models and clear category control, Classifier-Free Guidance provides greater versatility and control at the expense of additional model training.

3. ControlNets Zhang et al., 2023 [27] developed ControlNet, a separate module that can be added to an unconditional model for conditional image generation. The concept of seperate module is quite popular these days e.g. as in PEFTs (Parameter Efficient Fine Tuning) such as LoRA (Low Rank Adaptation) and its variants such as Q-LoRA.

Figure 10 : Control Net (Source: https://medium.com/@steinsfu/stable-diffusion-controlnet-clearly-explained-f86092b62c89)

Improvements to Diffusion Mode

We now discuss some popular improvements to the diffusion models:

Ho et al. (2020) [5] used a linear schedule from β₁ = 10⁻⁴ to βₜ = 0.02. Nichol and Dhariwal (2021) [11] proposed a cosine-based variance schedule (any arbitrary schedule will work as long as it offers a near-linear drop in the middle of training and subtle changes around t = 0 and t = T).

Figure 11 : Linear (top) vs Cosine variance schedule (bottom) (Source: Improved DDPM Paper by Nicol and
Dhariwal, 2021 )

The DDPM paper [5] also introduced a positional time step embedding, where half of the dimensions encode sine embedding and the other half encode cosine embedding
They also proposed learning the reverse process variance Σ_θ as an interpolation between βₜ and β̃ₜ, which gives:

Song et al., 2021 [28] proposed using deterministic sampling (Denoising Diffusion implicitly model — DDIM 2020), which has the same marginal noise distribution but deterministically maps noise back to the original data samples. Compared to DDPM, DDIM has higher sample quality for small steps, consistency of high-level features on conditioning and thus, the semantically meaningful representation of a latent variable.

4. Nicol and Dhariwal (2021) [11] also proposed speeding up diffusion process by strided sampling.

5. Latent Diffusion [27] (better known as ‘Stable Diffusion’) runs the diffusion process in latent space instead of pixel space, thus lower training cost and faster inference. The encoder downsamples to latent space, and the decoder is used to recover back the generated image.

Figure 12 : Illustration of an overview of the Stable Diffusion model within the latent space (Source: https://medium.com/@steinsfu/stable-diffusion-clearly-explained-ed008044e07e)

Figure 13 : Illustration of an overview of the Stable Diffusion model within the latent space (Source: https://medium.com/@steinsfu/stable-diffusion-clearly-explained-ed008044e07e)

6. Cold Diffusion [14], generalises the notion of noise by applying various transformations to the image. and uses a modified sampling algorithm to make the degradation function independent of the restoration operator up to firstorder terms.

I write about technology, investing and books I read. Here is an index to my other blogs (sorted by topic): https://medium.com/@shashwat.gpt/index-welcome-to-my-reflections-on-code-and-capital-2ac34c7213d9

References:

Blog: https://lilianweng.github.io/posts/2021-07-11-diffusion-models/
Blog: https://theaisummer.com/diffusion-models/
Blog: https://towardsdatascience.com/diffusion-models-made-easy-8414298ce4da (a simplistic explanation)
Video: https://www.youtube.com/watch?v=HoKDTa5jHvg&t=1284s
Paper: DDPM: https://arxiv.org/pdf/2006.11239.pdf (Ho et al., 2020)
Video Explanation: https://www.youtube.com/watch?v=W-O7AZNzbzQ
Annotated Code: https://huggingface.co/blog/annotated-diffusion
Blog — Variational AutoEncoders: https://lilianweng.github.io/posts/ 2018–08–12-vae/
Blog — Latent Variable Models: https://theaisummer.com/ latent-variable-models/
Paper: Deep Unsupervised Learning using Nonequilibrium Thermodynamics, Dickstein et al., 2015: https://arxiv.org/pdf/1503.03585.pdf
Paper: Improved Denoising Diffusion Probabilistic Models, Nicol and Dhariwal, 2021: https://arxiv.org/pdf/2102.09672.pdf
Paper: Diffusion Models beat GANs on Image Synthesis: https://arxiv.org/pdf/ 2105.05233.pdf
Paper: Generative Modelling by estimating Gradients of Data Distribution: Noiseconditioned score network, Yang and Ermon, 2019: https://arxiv.org/abs/1907. 05600
Paper: Cold Diffusion: https://arxiv.org/pdf/2208.09392.pdf
Paper: Understanding Diffusion Models: A Unified Perspective, Calvin Luo, 2022: https://arxiv.org/pdf/2208.11970.pdf
Paper: Fast Sampling of Diffusion Models with Exponential Integrator, Zhang et al., 2020: https://arxiv.org/abs/2204.13902
Paper: Classifier-Free Diffusion Guidance (Ho et al., 2021): https://openreview. net/pdf?id=qw8AKxfYbI
Paper: Diffusion Models: A Comprehensive study of Methods and Applications, Yang et al., 2022: https://arxiv.org/pdf/2209.00796.pdf
Diffusion and Score-based generative models: https://www.youtube.com/watch? v=wMmqCMwuM2Q
Blog: https://yang-song.net/blog/2021/score/
Blog : Autoregressive models, normalizing flow, energy-based models, VAEs. Scorepapers: https://scorebasedgenerativemodeling.github.io/
Blog: Guiding Diffusion Process: https://sander.ai/2022/05/26/guidance.html
Blog: Diffusion as autoencoders: https://sander.ai/2022/01/31/diffusion. html
Video : Langevin Dynamics end to end: https://www.youtube.com/watch?v= 3-KzIjoFJy4&t=2379s
Paper : Adding Conditional Control to Text-to-Image Diffusion Models, Zhang et al, 2023 https://arxiv.org/abs/2302.05543
Paper : Bayesian Learning vis Stochastic Gradient Langevin Dynamics, Welling and Teh, 2011 https://www.stats.ox.ac.uk/~teh/research/ compstats/WelTeh2011a.pdf
Paper : High-Resolution Image Synthesis with Latent Diffusion Models Rombach et al., 2022 https://arxiv.org/abs/2112.10752
Paper : Denoising Diffusion Implicit Models, Song et al.,2021 https://arxiv.org/
Blog : Stable Diffusion Clearly Explained https://medium.com/@steinsfu/stable-diffusion-clearly-explained-ed008044e07e

The original article (.pdf) is on slideshare has better formatting due to it being written on latex:

Diffusion_Models___A_concise_Perspective.pdf

www.slideshare.net

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

[AI/ML] Diffusion Models — A Beginner’s Guide to Math Behind Stable Diffusion and Dall-e!

Author(s): Shashwat Gupta

Introduction:

1. Markov Chain Perspective:

2. Langevin Dynamics Perspective (Noise-conditioned score networks):

Architecture and Algorithm:

Training and Sampling:

Training Process:

Sampling Process:

Conditined Generation:

Improvements to Diffusion Mode

References:

Diffusion_Models___A_concise_Perspective.pdf

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

I Built an AI Money Coach in Python — Here’s How You Can Too (Step-by-Step Guide!)

ChatGPT Now Works Natively in Xcode and VS Code

TAI #143: New Scaling Laws Incoming? Ilya’s SSI Raises at $30bn, Manus Takes AI Agents Mainstream

NN#12 — Neural Networks Decoded: Concepts Over Code

Future-Proof Your Marketing: Applied AI and Prompt Engineering for Homo Sapiens

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

[AI/ML] Diffusion Models — A Beginner’s Guide to Math Behind Stable Diffusion and Dall-e!

Author(s): Shashwat Gupta

Introduction:

1. Markov Chain Perspective:

2. Langevin Dynamics Perspective (Noise-conditioned score networks):

Architecture and Algorithm:

Training and Sampling:

Training Process:

Sampling Process:

Conditined Generation:

Improvements to Diffusion Mode

References:

Diffusion_Models___A_concise_Perspective.pdf

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement