[AI/ML] Diffusion Models — A Beginner’s Guide to Math Behind Stable Diffusion and Dall-e!
Last Updated on November 17, 2024 by Editorial Team
Author(s): Shashwat Gupta
Originally published on Towards AI.
Generative modeling in computer vision has advanced significantly, with Diffusion Models leading the way and powering tools like DALL-E and Stable Diffusion. These models have transformed machine capabilities, offering new possibilities in art, design, and content creation. This blog introduces Diffusion Models, focusing on their intuition and mathematics for understanding related research. While much existing literature explains these models from a Markov Chain perspective, alternative perspectives and conditioning methods during generation remain underexplored.
This blog explores the mathematics behind two key perspectives of Diffusion Models —
1. Markov Chain Perspective
2. Langevin Dynamics Perspective (Noise-conditioned Score Generation)
while also explaining their architecture, conditioning mechanisms, and popular modifications.
Introduction:
There are several types of generative models popular now (as shown in Figure 2), but none is without its flaws:
- Generative Adversarial Networks (GANs): suffer from unstable training and limited diversity (mode collapse).
- Variational Autoencoders (VAE) [8,9,23]: relies on a surrogate loss.
- Flow-based models: need specialized architectures to construct reversible transforms.
- 4. Diffusion Models: inspired by non-equilibrium thermodynamics. Despite being slow at sampling, diffusion models outperform other generative models; specifically, they are free from the issues of these models. Below, we explain the common perspectives to understand of diffusion models, specifically the ones that are needed to understand our architectures.
1. Markov Chain Perspective:
We touch upon the necessary mathematical details of the diffusion models without diving into the proofs much (More detailed treatment can be found in [1,2,3,4]). Our approach will mostly be like Denoising Diffusion Probabilistic Model (DDPM) [5,6,7] with some improvements suggested in papers published by OpenAI later on [11,12].
Diffusion Models are latent space models that involve adding noise to a sample as a Markov chain and then denoising the noisy image using a neural network. During training, noise is added (according to a variance schedule), and a model is used to denoise the image in multiple steps. During inference, denoising is applied to an isotropic noisy sample. Noising and denoising in steps, as opposed to single steps like GANs, leads to more tractable computations [10].
The forward process is defined as follows:
As t → ∞, xₜ approaches an isotropic Gaussian. For the forward process, xₜ can be computed in closed form from x₀ by using a reparametrization trick involving the sum of two Gaussian.
Defining two new variables:
Since βₜ is small, q(xₜ_₁|xₜ) is also Gaussian. However, estimating this quantity would require using the entire dataset, so we learn a model p_θ to approximate the conditional probabilities.
We run the reverse diffusion process:
The forward and reverse processes are pictograhically shown as follows (Figure 3 and 4):
We use simple likelihood for the loss: − log p_θ(x). Similar to VAEs, we use the variational lower bound (VLB) to upper bound of the objective [8,9]. Upon simplification and additional conditioning on x₀ (for better sampling), and ignoring pure q(xₜ) terms (since they have no learnable parameters), we come up with the following objective:
By observing that βₜ is fixed (as per the schedule), as an objective, we can minimize the MSE between ũt(xₜ, x₀) and µ_θ(xₜ). After simplification, this reduces to the MSE between the error at time t and the predicted error for time t predicted by the model, with a scaling term that improves sample quality.
2. Langevin Dynamics Perspective (Noise-conditioned score networks):
This perspective enables us to understand Conditional image generation. Again we touch upon the results (more can be followed from [13,19,20,21,24]) Stochastic Gradient Langevin Dynamics [26] can generate samples from a probability density p(x) using only the gradients ∇ₓ log p(x) in a Markov chain of updates.
Here, δ represents the step size. As T → ∞, ϵ → 0, and x_T converges to the true probability density p(x).
Compared to standard SGD, stochastic gradient Langevin Dynamics injects Gaussian noise into the parameter updates to avoid collapsing into local minima.
Song and Ermon (2019) [13] proposed score-based generative modelling methods where samples are produced via Langevin dynamics using gradients of the data distribution estimated with Stein score-matching.
To scale with high-dimension, they add a pre-specified small noise to the data and estimate the data point with score matching. According to the manifold hypothesis, most data is expected to lie on a low-dimensional manifold, even though the data might seem to be in high dimension. Thus, the data does not cover the entire space, and estimation is unreliable in sparse regions. Adding a small perturbation in steps to cover the entire space offers more stable training
Architecture and Algorithm:
The original implementation of DDPMs used U-Net architecture consisted of Wide ResNet blocks, group normalisation as well as self-attention blocks.
The main choices for the original DDPM Paper are:
- The diffusion time step t is specified by adding a sinusoidal position embedding into each residual block. (The time embedding is similar to Sinusoidal Positional embedding in Transformers paper : Attention is all you Need)
- The encoder and decoder paths have same count of levels with a bottleneck block between them. Each encoder stage consists of 2 residual block with convolutional downsampling, except for final level. The decoder state has 3 residual blocks. There are skip connections from encoder to decoder at each level.
- Attention modules at single feature map resolution
Various other approaches and architectures are covered in [15,18].
Training and Sampling:
Training Process:
In each batch of the training process, the following steps are taken:
- Sampling a random timestep t for each training sample within the batch (e.g. images)
- Adding Gaussian noise by using the closed-form formula, according to their timesteps t
- Converting the timesteps into embeddings for feeding the U-Net or similar models (or other family of models)
- Using the images with noise and time embeddings as input for predicting the noise present in the images
- Comparing the predicted noise to the actual noise for calculating the loss function
- Updating the Diffusion Model parameters via backpropagation using the loss function
This process repeats at each epoch, using the same images. However, different timesteps are usually sampled for each image at different epochs. This enables the model to learn reversing the diffusion process at any timestep, enhancing its adaptability
Sampling Process:
For sampling new images, the difference lies in that we don’t have an input image. We sample random Gaussian noise and define how much steps of noise (T) to take for generating the new images. In each step, the Diffusion Model predict the whole noise present in the image, taking as input the current timestep. Then, it removes just a fraction of this predicted noise. We obtain our image generation result after T inference steps.
The training and Sampling algorithms are shown in Figure 9.
Conditined Generation:
To turn a diffusion model into a conditioned model [22], we can add conditioning information (y) at each step with a guidance-scalar s as :
The above score-based formulation eliminates the term using p_θ(y), which needs knowledge of all data points.
The following are the popular ways to condition the diffusion model
1. Classifier Guided Diffusion The score of y wrt x can be estimated using a classifier [11]. Setting ∇_xₜ log q(y|xₜ) = ∇_xₜ log f_ϕ(y|xt)
The resulting ablated diffusion model (ADM) and the one with additional classifier guidance (ADM-G) can achieve better results than state-of-the-art generative models (e.g., BigGAN).
2. Classifier-free guidance Conditioning is also possible without a classifier [17]. Let unconditional denoising diffusion model p_θ(x) parameterized through a score estimator ϵ_θ(xₜ, t) and the conditional model p_θ(x|y) parameterized through ϵ_θ(xₜ, t, y). These two models can be learned via a single neural network. Precisely, a conditional diffusion model pθ(x|y) is trained on paired data (x, y), where the conditioning information y gets discarded periodically at random such that the model knows how to generate images unconditionally as well, i.e. ϵ_θ(xₜ, t) = ϵ_θ(xₜ, t, y = ϕ). The gradient of an implicit classifier can be represented with conditional and unconditional score estimators. Once plugged into the classifier-guided modified score, the score contains no dependency on a separate classifier.
Classifier-Guided and Classifier-Free Guidance methods differ primarily in their training requirements and flexibility of control over generated outputs. Classifier-Guided guidance necessitates training an additional classifier, typically using noisy images, to steer the diffusion process toward specific categories, but it does not require retraining the underlying diffusion model, allowing the use of pre-trained models as is. This approach limits control to the predefined classes that the classifier can recognize. In contrast, Classifier-Free Guidance leverages models like CLIP directly without needing a separate classifier, but it does require retraining the diffusion model to handle both conditional and unconditional data. This retraining enables much more flexible and nuanced control over the final output, allowing almost any condition to influence the generation process. Thus, while Classifier-Guided methods offer straightforward integration with existing models and clear category control, Classifier-Free Guidance provides greater versatility and control at the expense of additional model training.
3. ControlNets Zhang et al., 2023 [27] developed ControlNet, a separate module that can be added to an unconditional model for conditional image generation. The concept of seperate module is quite popular these days e.g. as in PEFTs (Parameter Efficient Fine Tuning) such as LoRA (Low Rank Adaptation) and its variants such as Q-LoRA.
Improvements to Diffusion Mode
We now discuss some popular improvements to the diffusion models:
- Ho et al. (2020) [5] used a linear schedule from β₁ = 10⁻⁴ to βₜ = 0.02. Nichol and Dhariwal (2021) [11] proposed a cosine-based variance schedule (any arbitrary schedule will work as long as it offers a near-linear drop in the middle of training and subtle changes around t = 0 and t = T).
- The DDPM paper [5] also introduced a positional time step embedding, where half of the dimensions encode sine embedding and the other half encode cosine embedding
- They also proposed learning the reverse process variance Σ_θ as an interpolation between βₜ and β̃ₜ, which gives:
Song et al., 2021 [28] proposed using deterministic sampling (Denoising Diffusion implicitly model — DDIM 2020), which has the same marginal noise distribution but deterministically maps noise back to the original data samples. Compared to DDPM, DDIM has higher sample quality for small steps, consistency of high-level features on conditioning and thus, the semantically meaningful representation of a latent variable.
4. Nicol and Dhariwal (2021) [11] also proposed speeding up diffusion process by strided sampling.
5. Latent Diffusion [27] (better known as ‘Stable Diffusion’) runs the diffusion process in latent space instead of pixel space, thus lower training cost and faster inference. The encoder downsamples to latent space, and the decoder is used to recover back the generated image.
6. Cold Diffusion [14], generalises the notion of noise by applying various transformations to the image. and uses a modified sampling algorithm to make the degradation function independent of the restoration operator up to firstorder terms.
I write about technology, investing and books I read. Here is an index to my other blogs (sorted by topic): https://medium.com/@shashwat.gpt/index-welcome-to-my-reflections-on-code-and-capital-2ac34c7213d9
References:
- Blog: https://lilianweng.github.io/posts/2021-07-11-diffusion-models/
- Blog: https://theaisummer.com/diffusion-models/
- Blog: https://towardsdatascience.com/diffusion-models-made-easy-8414298ce4da (a simplistic explanation)
- Video: https://www.youtube.com/watch?v=HoKDTa5jHvg&t=1284s
- Paper: DDPM: https://arxiv.org/pdf/2006.11239.pdf (Ho et al., 2020)
- Video Explanation: https://www.youtube.com/watch?v=W-O7AZNzbzQ
- Annotated Code: https://huggingface.co/blog/annotated-diffusion
- Blog — Variational AutoEncoders: https://lilianweng.github.io/posts/ 2018–08–12-vae/
- Blog — Latent Variable Models: https://theaisummer.com/ latent-variable-models/
- Paper: Deep Unsupervised Learning using Nonequilibrium Thermodynamics, Dickstein et al., 2015: https://arxiv.org/pdf/1503.03585.pdf
- Paper: Improved Denoising Diffusion Probabilistic Models, Nicol and Dhariwal, 2021: https://arxiv.org/pdf/2102.09672.pdf
- Paper: Diffusion Models beat GANs on Image Synthesis: https://arxiv.org/pdf/ 2105.05233.pdf
- Paper: Generative Modelling by estimating Gradients of Data Distribution: Noiseconditioned score network, Yang and Ermon, 2019: https://arxiv.org/abs/1907. 05600
- Paper: Cold Diffusion: https://arxiv.org/pdf/2208.09392.pdf
- Paper: Understanding Diffusion Models: A Unified Perspective, Calvin Luo, 2022: https://arxiv.org/pdf/2208.11970.pdf
- Paper: Fast Sampling of Diffusion Models with Exponential Integrator, Zhang et al., 2020: https://arxiv.org/abs/2204.13902
- Paper: Classifier-Free Diffusion Guidance (Ho et al., 2021): https://openreview. net/pdf?id=qw8AKxfYbI
- Paper: Diffusion Models: A Comprehensive study of Methods and Applications, Yang et al., 2022: https://arxiv.org/pdf/2209.00796.pdf
- Diffusion and Score-based generative models: https://www.youtube.com/watch? v=wMmqCMwuM2Q
- Blog: https://yang-song.net/blog/2021/score/
- Blog : Autoregressive models, normalizing flow, energy-based models, VAEs. Scorepapers: https://scorebasedgenerativemodeling.github.io/
- Blog: Guiding Diffusion Process: https://sander.ai/2022/05/26/guidance.html
- Blog: Diffusion as autoencoders: https://sander.ai/2022/01/31/diffusion. html
- Video : Langevin Dynamics end to end: https://www.youtube.com/watch?v= 3-KzIjoFJy4&t=2379s
- Paper : Adding Conditional Control to Text-to-Image Diffusion Models, Zhang et al, 2023 https://arxiv.org/abs/2302.05543
- Paper : Bayesian Learning vis Stochastic Gradient Langevin Dynamics, Welling and Teh, 2011 https://www.stats.ox.ac.uk/~teh/research/ compstats/WelTeh2011a.pdf
- Paper : High-Resolution Image Synthesis with Latent Diffusion Models Rombach et al., 2022 https://arxiv.org/abs/2112.10752
- Paper : Denoising Diffusion Implicit Models, Song et al.,2021 https://arxiv.org/
- Blog : Stable Diffusion Clearly Explained https://medium.com/@steinsfu/stable-diffusion-clearly-explained-ed008044e07e
The original article (.pdf) is on slideshare has better formatting due to it being written on latex:
Diffusion_Models___A_concise_Perspective.pdf
www.slideshare.net
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI