Evaluating Mode Collapse in GANs Using NDB Score

Last Updated on July 25, 2023 by Editorial Team

Author(s): Shashank Kumar

Originally published on Towards AI.

Below are a few art pieces I generated from a GAN. They aren’t striking at all, but they’re diverse. However, this is not always the case.

Evaluating Mode Collapse in GANs Using NDB Score — GAN generated images

The next set of images is from another GAN I trained. Not only are they awful, but they’re also identical.

GANs are notoriously hard to train. They seldom, if ever, converge and often suffer from mode collapse. As illustrated in the above images, mode collapse happens when GANs fail to pick up the different modes present in data distribution and generate similar pictures relentlessly.

It’s convenient to spot mode collapse by merely plotting images, but as dataset size increases, it might be handy to evaluate it quantitatively. We’ll do that using the NDB score.

This post assumes familiarity with the GAN training mechanism. Refer to this post if you don’t know how they function.

Mode Collapse

You see, mode collapse is ingrained in the GAN training strategy. Real-world data is multi-modal, and an ideal GAN must capture them all. For instance, each digit in the MNIST dataset is a separate mode, and you’d prefer a GAN that generates all the numbers. However, we generally never incentivize them to do so.

Suppose the generator constructs the digit ‘2’ well enough to fool the discriminator. It doesn’t need to hustle anymore. The discriminator, though, during its training iteration, will receive these generated twos labeled as fake and, over time, learn to catch the bluff. When this happens, the generator could easily switch to another digit, say ‘3’, and continue the mode collapse loop. Intuitively, you could consider this as apathy to work extra when less is sufficient.

Now, let’s learn to track this phenomenon qualitatively.

Setting up the GAN

The full notebook for this implementation can be found at these links:

GAN Art and NDB score

Explore and run machine learning code with Kaggle Notebooks U+007C Using data from multiple data sources

www.kaggle.com

https://github.com/shashank14k/Generative_Models/blob/main/GAN/notebooks/gan-art-and-ndb-score.ipynb

The data used for training can be found here (license). These are a few images from it.

We’ll start by making these imports.

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Reshape, Conv2D, BatchNormalization, Conv2DTranspose
from tensorflow.keras.layers import LeakyReLU, Dropout, ZeroPadding2D, Flatten, Activation
from tensorflow.keras.optimizers import Adam
from sklearn.cluster import KMeans

Next, we’ll load the images from the directory using the TensorFlow data loader, reduce their shapes to (64,64), and normalize them. Note the batch size here is half of the global batch because the other half would come from generator images.

BATCH = 64
IMG_SIZE = (64,64)
LATENT_DIM = 100
EPOCHS = 600
PATH = "../input/abstract-art-gallery/Abstract_gallery/Abstract_gallery"
#Importing data
batch_s = int(BATCH/2)
#Import as tf.Dataset
data = tf.keras.preprocessing.image_dataset_from_directory(PATH, label_mode = None, image_size = IMG_SIZE, batch_size = batch_s).map(lambda x: x /255.0)

Let us now build the generator and the discriminator. Note that the discriminator does include any pooling layers. According to this 2015 paper, stridden convolutions perform better than pooling layers.

generator=Sequential()
generator.add(Dense(4*4*512,input_shape=[LATENT_DIM]))
generator.add(Reshape([4,4,512]))
generator.add(Conv2DTranspose(256, kernel_size=4, strides=2, padding="same"))
generator.add(LeakyReLU(alpha=0.2))
generator.add(BatchNormalization())
generator.add(Conv2DTranspose(128, kernel_size=4, strides=2, padding="same"))
generator.add(LeakyReLU(alpha=0.2))
generator.add(BatchNormalization())
generator.add(Conv2DTranspose(64, kernel_size=4, strides=2, padding="same"))
generator.add(LeakyReLU(alpha=0.2))
generator.add(BatchNormalization())
generator.add(Conv2DTranspose(3, kernel_size=4, strides=2, padding="same",
 activation='sigmoid'))discriminator=Sequential()
discriminator.add(Conv2D(32, kernel_size=4, strides=2, padding="same",input_shape=[64,64, 3]))
discriminator.add(Conv2D(64, kernel_size=4, strides=2, padding="same"))
discriminator.add(LeakyReLU(0.2))
discriminator.add(BatchNormalization())
discriminator.add(Conv2D(128, kernel_size=4, strides=2, padding="same"))
discriminator.add(LeakyReLU(0.2))
discriminator.add(BatchNormalization())
discriminator.add(Conv2D(256, kernel_size=4, strides=2, padding="same"))
discriminator.add(LeakyReLU(0.2))
discriminator.add(Flatten())
discriminator.add(Dropout(0.5))
discriminator.add(Dense(1,activation='sigmoid'))

Define the training process

class GAN(tf.keras.Model):
 def __init__(self, discriminator, generator, latent_dim):
 super(GAN, self).__init__()
 self.discriminator = discriminator
 self.generator = generator
 self.latent_dim = latent_dim

 def compile(self, d_optimizer, g_optimizer, loss_fn):
 super(GAN, self).compile()
 self.d_optimizer = d_optimizer
 self.g_optimizer = g_optimizer
 self.loss_fn = loss_fn
 self.dloss = tf.keras.metrics.Mean(name="discriminator_loss")
 self.gloss = tf.keras.metrics.Mean(name="generator_loss")

 @property
 def metrics(self):
 return [self.dloss, self.gloss]

 def train_step(self, real_images):
 batch_size = tf.shape(real_images)[0]
 noise = tf.random.normal(shape=(batch_size, self.latent_dim))
 generated_images = self.generator(noise)
 combined_images = tf.concat([generated_images, real_images], axis=0)
 labels = tf.concat([tf.ones((batch_size, 1)), tf.zeros((batch_size, 1))], axis=0)
 labels += 0.05 * tf.random.uniform(tf.shape(labels))
 with tf.GradientTape() as tape:
 predictions = self.discriminator(combined_images)
 dloss = self.loss_fn(labels, predictions)grads = tape.gradient(dloss, self.discriminator.trainable_weights)
 self.d_optimizer.apply_gradients(zip(grads, self.discriminator.trainable_weights))

 noise = tf.random.normal(shape=(2*batch_size, self.latent_dim))
 labels = tf.zeros((2*batch_size, 1))
 with tf.GradientTape() as tape:
 predictions = self.discriminator(self.generator(noise))
 gloss = self.loss_fn(labels, predictions)
 grads = tape.gradient(gloss, self.generator.trainable_weights)
 self.g_optimizer.apply_gradients(zip(grads, self.generator.trainable_weights))
 self.dloss.update_state(dloss)
 self.gloss.update_state(gloss)
 return {"d_loss": self.dloss.result(), "g_loss": self.gloss.result()}

Now, let’s move to the evaluation. We’ll use k-means clustering, which might not make sense qualitatively since all our images are random paintings(single class). Nevertheless, I expect the k-means algorithm to identify subtle similarities among them and create appropriate clusters.

We have RGB images of shape (64,64). To reduce dimensions, we’ll average the arrays along the last axis to convert them to grayscale. Note that the actual formula for converting to grayscale is different. Refer to this link for more information. We can further shrink the dimensions using autoencoders/PCA, but I’ll refrain for now. Finally, for clustering, we’ll also flatten the images.

images = np.asarray(images)
images = np.mean(images,axis=3)
images = images.reshape((images.shape[0],-1))

To limit computation effort, I’ve only used the first 500 images to create clusters. The elbow score is not quite plateauing yet. It could be because of the subtle differences between images of the same class. Further reduction in image dimensionality might help create better clusters. For illustration, we’ll work with the kink at cluster 7.

elbow_scores=[]
for c in range(4,10):
 kmeans = KMeans(c)
 kmeans.fit(images[:500])
 elbow_scores.append(kmeans.inertia_)

plt.plot(range(4,10),elbow_scores)
plt.xlabel('Number of Clusters')
plt.title('Elbow Score')
plt.show()

Now, we’ll generate 500 images from the generator and see which clusters they fall into.

kmeans=KMeans(7)
train_classes=kmeans.fit_predict(images[:500])
arr = tf.random.normal(shape=(500,LATENT_DIM))
generated_portraits = generator(arr)
generated_portraits = np.array(generated_portraits).mean(axis=3).reshape((generated_portraits.shape[0],-1))
generated_classes = kmeans.predict(generated_portraits)

We have generated images from all but cluster 4. The GAN seems to have learned the distribution well and could improve with more training iterations/hyper-parameter tuning. Next, we expand this evaluation to create a more concrete statistical test (NDB score).

NDB Score

The ideal GAN must closely mimic the real data distribution. This is quantified using the NDB score. Here’s how it's computed:

Cluster the training data (t samples) into ’n’ bins (Like we have clustered the paintings into 7 bins)
Generate (g samples) of images
Predict the cluster(bin) of each generated image
For each bin, do the following test:

a. Compute the proportions of training and generated samples in the bin

b.Divide their difference by the standard error SE, which is calculated as shown below.

Here ‘p’ and ‘q’ are used to refer to training and generated data, and ‘P’ is the pooled sample proportion.

c. If the p-value corresponding to the z-score is less than a threshold, the bin is deemed statistically different

5. Divide the number of statistically different bins by the total number of bins. This yields a number b/w 0 and 1, quantifying the difference between the real and learned distributions.

6. If the above quantity is greater than a set threshold, the GAN is deemed to have encountered mode collapse.

def ndb_score(training_data_classes,generated_data_classes,num_classes,z_threshold):
 ndb = []
 NT = len(training_data_classes)
 NG = len(generated_data_classes)
 for i in range(num_classes):
 nt = np.sum(training_data_classes==i)
 pt = nt/len(training_data_classes) #training data proportion for bin
 ng = np.sum(generated_data_classes==i)
 pg = ng/len(generated_data_classes) #generated data proportion for bin
 P = (nt+ng)/(NT+NG)
 SE = (P*(1-P)*((1/NT)+(1/NG)))**0.5
 if abs((pt-pg)/SE) > z_threshold:
 ndb.append(i)
 print(f"Statisticall different classes:{ndb}")
 print(f"ndb score: {len(ndb)/num_classes}")

Our GAN has an NDB score of 0.25, and only two clusters-4,5- appear statistically different. So, we’ve successfully avoided the devious trench of mode collapse. This function can be made part of the GAN class and run at the end of each epoch as a validation scheme. You’ll find that code here.

Conclusion

Thanks for reading till the end. There are a lot of ideas around how to avoid mode collapse. Besides tuning hyper-parameters and trying different loss functions, one could adopt a different training strategy. This paper details a few mechanisms. I’ll try to cover them in some other post.

References:

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Evaluating Mode Collapse in GANs Using NDB Score

Author(s): Shashank Kumar

Mode Collapse

Setting up the GAN

GAN Art and NDB score

Explore and run machine learning code with Kaggle Notebooks U+007C Using data from multiple data sources

NDB Score

Conclusion

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Recent Posts

Full-Stack Data Scientists for the Agentic Coding World

Building Production-Grade AI Skills with Snowflake Cortex AI Function Studio

I Tried 10 AI Agent Frameworks in 2026 — Here’s the Honest Guide I Wish I Had Earlier

How One Spring Boot Optimization Saved Our Startup $30,000 a Year

Inside Palantir AIP: How the World’s Most Controversial AI Platform Actually Works

What Is a Reverse Proxy? (And Why Every Backend Developer Should Care)

What Claude Opus 4.8 Actually Changes If You’re Building Agents

QWEN 3.7 Max Worked For 35 Hrs Straight And The Results Were Mind-blowing

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Evaluating Mode Collapse in GANs Using NDB Score

Author(s): Shashank Kumar

Mode Collapse

Setting up the GAN

GAN Art and NDB score

Explore and run machine learning code with Kaggle Notebooks U+007C Using data from multiple data sources

NDB Score

Conclusion

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Related posts

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement