The Magic of Variational Autoencoders (VAE), Where Creativity Begins! 🎨
Last Updated on June 3, 2024 by Editorial Team
Author(s): JAIGANESAN
Originally published on Towards AI.
The Magic of Variational Autoencoders (VAE), Where Creativity Begins! 🎨
Iβm assuming youβre already familiar with the basics of autoencoder, convolution, transpose convolution, and latent dimension and how they work. If not, I highly recommend checking out my previous article, βAutoencoder is Simple 😃β, to get up to speed. Otherwise, this article wonβt make sense to you 😟.
Autoencoder is Simple 😲 !
Discover the elegance in Image compression and reconstruction with autoencoders β where simplicity meetsβ¦
medium.com
In this article, weβre going to dive into the world of Variational Autoencoder (VAE) and explore what sets them apart from traditional autoencoder. Iβll solely focus on the reparameterization trick and the loss function that makes VAE different from the autoencoder. So, letβs dive in!
In this article, I will intentionally use certain sentences and words repeatedly to ensure that my message is clear.
So, whatβs the magic behind VAEs?
In autoencoders, the latent vector is directly fed into the decoder without any changes. But VAE introduces a twist β a reparameterization trick that adds an element of randomness to the latent space. This subtle change forces the decoder to be more robust in reconstructing the original image, allowing it to generate new images that are similar to, or slightly different from the input images.
Unlike autoencoders, which have deterministic layers, Variational Autoencoders (VAEs) introduce a bit of randomness into the mix. Encoder output fed into two linear layers to calculate mean and variance vector. To generate the actual latent vector, the model samples from this distribution using an additional random component called epsilon. This touch of randomness is what sets VAE apart and makes them capable of generating new images.
The latent vector, Z, follows a standard normal distribution, Z ~ N(0,1), meaning it has a mean of zero and a standard deviation of one (the Regularization Term forces the latent space to be in a normal distribution ). But hereβs the clever part β the mean and standard deviation are parameterized, meaning theyβre created from a linear layer with learnable weights ( W1, W2 ). During training, these weights are adjusted to optimize the model.
self.fc_mu = nn.Linear(2048*4*4, latent_dim) # Mean vector from Encoder output
self.fc_logvar = nn.Linear(2048*4*4, latent_dim) # Variance Vector from Encoder output
By introducing this probabilistic element, VAE can generate new, diverse samples that are similar to the input data, making them incredibly powerful for tasks like image generation and data augmentation.
VAE Optimization: Unraveling the Encoder and Decoder
In a Variational Autoencoder (VAE), the encoder and decoder play crucial roles in learning and generating images.
The Encoderβs Job: Learning the Latent Space
The encoder computes q_phi(z|x), which represents the probabilistic distribution of the latent variable z given the input image x. In other words, it learns to map the input images to a probabilistic distribution in the latent space.
# Encoder Input : 64 X 64 X 3 ( Input Image )
self.encoder = nn.Sequential(
nn.Conv2d(3, 32, kernel_size=3, stride=1, padding=1), # output dimension: 64 X 64 X 32
nn.ReLU(),
nn.Conv2d(32, 128, kernel_size=3, stride=1, padding=1), # 64 X 64 X 128
nn.ReLU(),
nn.Conv2d(128, 256, kernel_size=4, stride=2, padding=1), # 32 X 32 X 256
nn.ReLU(),
nn.Conv2d(256, 512, kernel_size=4, stride=2, padding=1), # 16 X 16 X 512
nn.ReLU(),
nn.Conv2d(512, 1024, kernel_size=4, stride=2, padding=1), # 8 X 8 X 1024
nn.ReLU(),
nn.Conv2d(1024, 2048, kernel_size=4, stride=2, padding=1), # 4 X 4 X 2048
nn.ReLU(),
)
The Decoderβs Job: Reconstructing the Input Space
The decoder computes p_theta(x|z), which represents the probabilistic distribution of the input space x (reconstructed images) given the latent variable z. This means it learns to generate images from the latent space.
# Decoder Input : 4 X 4 X 2048
self.decoder = nn.Sequential(
nn.ConvTranspose2d(2048, 1024, kernel_size=4, stride=2, padding=1), # Output dimension : 8 X 8 X 1024
nn.ReLU(),
nn.ConvTranspose2d(1024, 512, kernel_size=4, stride=2, padding=1), # 16 X 16 X 512
nn.ReLU(),
nn.ConvTranspose2d(512, 256, kernel_size=4, stride=2, padding=1), # 32 X 32 X 256
nn.ReLU(),
nn.ConvTranspose2d(256, 128, kernel_size=4, stride=2, padding=1), # 64 X 64 X 128
nn.ReLU(),
nn.ConvTranspose2d(128, 32, kernel_size=3, stride=1, padding=1), # 64 X 64 X 32
nn.ReLU(),
nn.ConvTranspose2d(32, 3, kernel_size=3, stride=1, padding=1), # 64 X 64 X 3 (Reconstructed image)
nn.Sigmoid()
)
The Learnable Parameters: The phi and theta symbols denote the learnable parameters in the encoder and decoder networks, including weights, kernels, and biases.
The standard VAE Loss :
When training a Variational Autoencoder (VAE), the loss function we need to optimize has two crucial components: the reconstruction loss and the regularization term. For the reconstruction loss, I have used Mean Squared Error (MSE), while for the regularization term, I used KL divergence.
def vae_loss(x_recon, x, mu, logvar):
recon_loss = F.mse_loss(x_recon, x, reduction='sum')
kl_divergence = - 0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
"""
Reconstruction term = mse loss
Regularization term = Kl divergence checks the latent space is standard
normal distribution or not and result the difference.
"""
return recon_loss + kl_divergence
L(phi, theta, x) = Reconstruction Loss + Regularization Term
The reconstruction loss is all about measuring how well the VAE can reconstruct the input images. Which gives us an idea of how similar the reconstructed images are to the originals.
D ( q_phi (z | x) || p(z) )
The regularization term ( D), on the other hand, is where things get really interesting. This term is calculated using the Kullback-Leibler (KL) divergence, which measures the difference between the learned approximate posterior distribution q_phi(z|x) and the prior distribution p(z).
Why do we use the KL divergence? KL divergence is to calculate or quantify the difference between the two probabilistic distributions.
The Common Choice of Prior is Normal Gaussian :
p (z) = N( mean = 0, std = 1 )
Think of it like a penalty term that encourages the VAE to keep its posterior distribution close (Latent vector ) to the prior distribution, which is usually a Gaussian distribution with zero mean and unit variance. Enforce the Latent Vectors to be roughly standard normal Gaussian distribution and donβt get too much divergence. The reason why the two vectors got itβs name mean and variance.
By doing so, the regularization term helps the VAE avoid overfitting and promotes a more robust representation of the data(Image). Itβs like a gentle nudge that keeps the VAE on track, ensuring it doesnβt get too caught up in fitting the training data perfectly but instead learns to generalize well to new, unseen data.
So, whatβs the intuition behind using regularization and a normal prior in VAEs?
Well, weβre trying to achieve two key properties: continuity and completeness.
continuity: We want points that are close together in the latent space to correspond to similar inputs. This means that if we move slightly in the latent space, the decoded output should change smoothly and gradually. Think of it like a continuous spectrum of images, where similar images are clustered together.
completeness: We also want to ensure that sampling from the latent space produces meaningful and coherent content. When we decode a sample from the latent space, we want every pixel in the reconstructed image to be in the correct place, making sense of the overall picture. This means that the VAE should be able to generate a diverse range of images that are all plausible and realistic.
By using regularization with a normal prior, weβre able to enforce an information gradient in the latent space. This means that the VAE learns to represent the input data in a way thatβs both continuous and complete. The normal prior helps to βpushβ the learned representation towards a more structured and meaningful organization, making it easier to generate new samples that are similar to the training images.
Is everything OK now? No, we have a problem: The Challenge is backpropagation with Stochasticity(Randomness).
So, how do we handle backpropagation when thereβs stochasticity involved? Well, it turns out thatβs a major problem. The problem is that we canβt backpropagate gradients through sampling layers, which are essentially layers that introduce randomness.
Backpropagation requires a complete deterministic pipeline, where each network and layer behaves in a deterministic way. This is essential for the gradient descent algorithm to work and update the modelβs parameters.
However, when we introduce stochasticity through sampling layers, we break this deterministic chain. The randomness in these layers makes it impossible to compute the gradients accurately, which means we canβt apply the backpropagation algorithm as usual.
This is where the reparameterization trick comes in, which weβll discuss next. Itβs a clever workaround that allows us to approximate the gradients and still train the VAE using backpropagation.
Reparameterizing the Sampling Layer:
So, how do we deal with the stochasticity in the sampling layer? Well, we can reparameterize it in a way that allows us to backpropagate through the network.
We can represent the sampling layer as Z ~ N(mean, std), where mean is a fixed vector and std is a standard deviation vector scaled by random constants drawn from a prior distribution (in this case, a normal distribution).
By doing so, we can rewrite Z as Z = mean + std * epsilon, where epsilon ~ N(0, 1). This is the reparameterization trick!
def reparameterize(self, mu, logvar):
"""
Latent vector / variable created with mean, std and epsilon. this epsilon
is N(0,1). It help ot regularize the latent space by forcing it to normal
distribution.
"""
std = torch.exp(0.5 * logvar)
eps = torch.randn_like(std)
return mu + eps * std
By separating the randomness (represented by epsilon) from the deterministic nodes and layers, we can now backpropagate through the network. This allows us to update the weights and kernel using optimizers.
Variational Autoencoders (VAEs) have a wide range of applications, from image generation and data augmentation to anomaly detection, feature learning from images, image-to-image translation tasks, material discovery, and chemical planning synthesis. The list goes on! VAEs have become a pivotal tool in many areas of research and industry.
One last thing I also want to mention is the transformation from Latent vector to VAE Decoder :
Image 7 represents, with the help of a linear layer the latent vector is projected into the decoder input size. Then itβs reshaped into the decoder input dimension (Image 8 ).
self.decoder_input = nn.Linear(latent_dim, 2048*4*4) # Latent to decoder input size
Note: The weights in the linear layers W1, W2, and W3 are learnable parameters, which means they are updated during training. For simplicity, I did not use any bias in my example.
If you have time, please check out my VAE implementation on Kaggle.
If you donβt understand linear projection, I highly recommend you read my previous article about Neural networks 👽 😃
The Randomness gives Creativity in VAE 💡 🚀
I believe I have made some sense of Variational Auto Encoder ( VAE ). If you found my article useful 👍, give it a👏! Feel free to follow for more insights. If you donβt understand take some time, read it again. it will make some sense.
Letβs also stay in touch on 🔗LinkedIn🌏❤οΈto keep the conversation going!
References :
- https://direct.mit.edu/neco/article/34/1/1/107911/Predictive-Coding-Variational-Autoencoders-and
- http://introtodeeplearning.com/2019/materials/2019_6S191_L4.pdf
- https://theaisummer.com/latent-variable-models/
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI