Evaluating Mode Collapse in GANs Using NDB Score
Last Updated on July 25, 2023 by Editorial Team
Author(s): Shashank Kumar
Originally published on Towards AI.
Below are a few art pieces I generated from a GAN. They arenβt striking at all, but theyβre diverse. However, this is not always the case.
The next set of images is from another GAN I trained. Not only are they awful, but theyβre also identical.
GANs are notoriously hard to train. They seldom, if ever, converge and often suffer from mode collapse. As illustrated in the above images, mode collapse happens when GANs fail to pick up the different modes present in data distribution and generate similar pictures relentlessly.
Itβs convenient to spot mode collapse by merely plotting images, but as dataset size increases, it might be handy to evaluate it quantitatively. Weβll do that using the NDB score.
This post assumes familiarity with the GAN training mechanism. Refer to this post if you donβt know how they function.
Mode Collapse
You see, mode collapse is ingrained in the GAN training strategy. Real-world data is multi-modal, and an ideal GAN must capture them all. For instance, each digit in the MNIST dataset is a separate mode, and youβd prefer a GAN that generates all the numbers. However, we generally never incentivize them to do so.
Suppose the generator constructs the digit β2β well enough to fool the discriminator. It doesnβt need to hustle anymore. The discriminator, though, during its training iteration, will receive these generated twos labeled as fake and, over time, learn to catch the bluff. When this happens, the generator could easily switch to another digit, say β3β, and continue the mode collapse loop. Intuitively, you could consider this as apathy to work extra when less is sufficient.
Now, letβs learn to track this phenomenon qualitatively.
Setting up the GAN
The full notebook for this implementation can be found at these links:
GAN Art and NDB score
Explore and run machine learning code with Kaggle Notebooks U+007C Using data from multiple data sources
www.kaggle.com
https://github.com/shashank14k/Generative_Models/blob/main/GAN/notebooks/gan-art-and-ndb-score.ipynb
The data used for training can be found here (license). These are a few images from it.
Weβll start by making these imports.
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Reshape, Conv2D, BatchNormalization, Conv2DTranspose
from tensorflow.keras.layers import LeakyReLU, Dropout, ZeroPadding2D, Flatten, Activation
from tensorflow.keras.optimizers import Adam
from sklearn.cluster import KMeans
Next, weβll load the images from the directory using the TensorFlow data loader, reduce their shapes to (64,64), and normalize them. Note the batch size here is half of the global batch because the other half would come from generator images.
BATCH = 64
IMG_SIZE = (64,64)
LATENT_DIM = 100
EPOCHS = 600
PATH = "../input/abstract-art-gallery/Abstract_gallery/Abstract_gallery"
#Importing data
batch_s = int(BATCH/2)
#Import as tf.Dataset
data = tf.keras.preprocessing.image_dataset_from_directory(PATH, label_mode = None, image_size = IMG_SIZE, batch_size = batch_s).map(lambda x: x /255.0)
Let us now build the generator and the discriminator. Note that the discriminator does include any pooling layers. According to this 2015 paper, stridden convolutions perform better than pooling layers.
generator=Sequential()
generator.add(Dense(4*4*512,input_shape=[LATENT_DIM]))
generator.add(Reshape([4,4,512]))
generator.add(Conv2DTranspose(256, kernel_size=4, strides=2, padding="same"))
generator.add(LeakyReLU(alpha=0.2))
generator.add(BatchNormalization())
generator.add(Conv2DTranspose(128, kernel_size=4, strides=2, padding="same"))
generator.add(LeakyReLU(alpha=0.2))
generator.add(BatchNormalization())
generator.add(Conv2DTranspose(64, kernel_size=4, strides=2, padding="same"))
generator.add(LeakyReLU(alpha=0.2))
generator.add(BatchNormalization())
generator.add(Conv2DTranspose(3, kernel_size=4, strides=2, padding="same",
activation='sigmoid'))discriminator=Sequential()
discriminator.add(Conv2D(32, kernel_size=4, strides=2, padding="same",input_shape=[64,64, 3]))
discriminator.add(Conv2D(64, kernel_size=4, strides=2, padding="same"))
discriminator.add(LeakyReLU(0.2))
discriminator.add(BatchNormalization())
discriminator.add(Conv2D(128, kernel_size=4, strides=2, padding="same"))
discriminator.add(LeakyReLU(0.2))
discriminator.add(BatchNormalization())
discriminator.add(Conv2D(256, kernel_size=4, strides=2, padding="same"))
discriminator.add(LeakyReLU(0.2))
discriminator.add(Flatten())
discriminator.add(Dropout(0.5))
discriminator.add(Dense(1,activation='sigmoid'))
Define the training process
class GAN(tf.keras.Model):
def __init__(self, discriminator, generator, latent_dim):
super(GAN, self).__init__()
self.discriminator = discriminator
self.generator = generator
self.latent_dim = latent_dim
def compile(self, d_optimizer, g_optimizer, loss_fn):
super(GAN, self).compile()
self.d_optimizer = d_optimizer
self.g_optimizer = g_optimizer
self.loss_fn = loss_fn
self.dloss = tf.keras.metrics.Mean(name="discriminator_loss")
self.gloss = tf.keras.metrics.Mean(name="generator_loss")
@property
def metrics(self):
return [self.dloss, self.gloss]
def train_step(self, real_images):
batch_size = tf.shape(real_images)[0]
noise = tf.random.normal(shape=(batch_size, self.latent_dim))
generated_images = self.generator(noise)
combined_images = tf.concat([generated_images, real_images], axis=0)
labels = tf.concat([tf.ones((batch_size, 1)), tf.zeros((batch_size, 1))], axis=0)
labels += 0.05 * tf.random.uniform(tf.shape(labels))
with tf.GradientTape() as tape:
predictions = self.discriminator(combined_images)
dloss = self.loss_fn(labels, predictions)grads = tape.gradient(dloss, self.discriminator.trainable_weights)
self.d_optimizer.apply_gradients(zip(grads, self.discriminator.trainable_weights))
noise = tf.random.normal(shape=(2*batch_size, self.latent_dim))
labels = tf.zeros((2*batch_size, 1))
with tf.GradientTape() as tape:
predictions = self.discriminator(self.generator(noise))
gloss = self.loss_fn(labels, predictions)
grads = tape.gradient(gloss, self.generator.trainable_weights)
self.g_optimizer.apply_gradients(zip(grads, self.generator.trainable_weights))
self.dloss.update_state(dloss)
self.gloss.update_state(gloss)
return {"d_loss": self.dloss.result(), "g_loss": self.gloss.result()}
Now, letβs move to the evaluation. Weβll use k-means clustering, which might not make sense qualitatively since all our images are random paintings(single class). Nevertheless, I expect the k-means algorithm to identify subtle similarities among them and create appropriate clusters.
We have RGB images of shape (64,64). To reduce dimensions, weβll average the arrays along the last axis to convert them to grayscale. Note that the actual formula for converting to grayscale is different. Refer to this link for more information. We can further shrink the dimensions using autoencoders/PCA, but Iβll refrain for now. Finally, for clustering, weβll also flatten the images.
images = np.asarray(images)
images = np.mean(images,axis=3)
images = images.reshape((images.shape[0],-1))
To limit computation effort, Iβve only used the first 500 images to create clusters. The elbow score is not quite plateauing yet. It could be because of the subtle differences between images of the same class. Further reduction in image dimensionality might help create better clusters. For illustration, weβll work with the kink at cluster 7.
elbow_scores=[]
for c in range(4,10):
kmeans = KMeans(c)
kmeans.fit(images[:500])
elbow_scores.append(kmeans.inertia_)
plt.plot(range(4,10),elbow_scores)
plt.xlabel('Number of Clusters')
plt.title('Elbow Score')
plt.show()
Now, weβll generate 500 images from the generator and see which clusters they fall into.
kmeans=KMeans(7)
train_classes=kmeans.fit_predict(images[:500])
arr = tf.random.normal(shape=(500,LATENT_DIM))
generated_portraits = generator(arr)
generated_portraits = np.array(generated_portraits).mean(axis=3).reshape((generated_portraits.shape[0],-1))
generated_classes = kmeans.predict(generated_portraits)
We have generated images from all but cluster 4. The GAN seems to have learned the distribution well and could improve with more training iterations/hyper-parameter tuning. Next, we expand this evaluation to create a more concrete statistical test (NDB score).
NDB Score
The ideal GAN must closely mimic the real data distribution. This is quantified using the NDB score. Hereβs how it's computed:
- Cluster the training data (t samples) into βnβ bins (Like we have clustered the paintings into 7 bins)
- Generate (g samples) of images
- Predict the cluster(bin) of each generated image
- For each bin, do the following test:
a. Compute the proportions of training and generated samples in the bin
b.Divide their difference by the standard error SE, which is calculated as shown below.
Here βpβ and βqβ are used to refer to training and generated data, and βPβ is the pooled sample proportion.
c. If the p-value corresponding to the z-score is less than a threshold, the bin is deemed statistically different
5. Divide the number of statistically different bins by the total number of bins. This yields a number b/w 0 and 1, quantifying the difference between the real and learned distributions.
6. If the above quantity is greater than a set threshold, the GAN is deemed to have encountered mode collapse.
def ndb_score(training_data_classes,generated_data_classes,num_classes,z_threshold):
ndb = []
NT = len(training_data_classes)
NG = len(generated_data_classes)
for i in range(num_classes):
nt = np.sum(training_data_classes==i)
pt = nt/len(training_data_classes) #training data proportion for bin
ng = np.sum(generated_data_classes==i)
pg = ng/len(generated_data_classes) #generated data proportion for bin
P = (nt+ng)/(NT+NG)
SE = (P*(1-P)*((1/NT)+(1/NG)))**0.5
if abs((pt-pg)/SE) > z_threshold:
ndb.append(i)
print(f"Statisticall different classes:{ndb}")
print(f"ndb score: {len(ndb)/num_classes}")
Our GAN has an NDB score of 0.25, and only two clusters-4,5- appear statistically different. So, weβve successfully avoided the devious trench of mode collapse. This function can be made part of the GAN class and run at the end of each epoch as a validation scheme. Youβll find that code here.
Conclusion
Thanks for reading till the end. There are a lot of ideas around how to avoid mode collapse. Besides tuning hyper-parameters and trying different loss functions, one could adopt a different training strategy. This paper details a few mechanisms. Iβll try to cover them in some other post.
References:
- https://wandb.ai/authors/DCGAN-ndb-test/reports/Measuring-Mode-Collapse-in-GANs–VmlldzoxNzg5MDk#:~:text=The%20NDB%20score%20is%20one,in%20the%20score%20over%20time.
- https://arxiv.org/abs/1805.12462
- Dataset : https://www.kaggle.com/datasets/bryanb/abstract-art-gallery
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI