Explainable Defect Detection Using Convolutional Neural Networks: Case Study
Last Updated on January 17, 2022 by Editorial Team
Author(s): Olga Chernytska
Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.
Deep Learning
Train object detection model without having any bounding boxes labels. This post shows the power of Explainable AI.
Despite being extremely accurate, neural networks are not that widely used in the domains, where prediction explainability is a requirement, such as medicine, banking, education, etc.
In this tutorial, Iβll show you how to overcome this explainability limitation for Convolutional Neural Networks. And it isβββby exploring, inspecting, processing, and visualizing feature maps produced by deep neural network layers. We will go through the approach and discuss how to apply it to a real-world taskβββDefect Detection.
Iβve created a Github repository for this project, where you can find all data preparation, model, training, and evaluation scripts.
Contents
βββTask
βββTraining Pipeline
βββInference Pipeline
βββEvaluation
βββConclusion
Task
You are given a 400-image dataset, that contains images of good items (labeled as class βGoodβ) and items with a defect (labeled as class βAnomalyβ). Dataset is imbalancedβββwith more samples of good images than defective ones. Item in the image may be literally of any type and complexityβββbottle, cable, pill, tile, leather, zipper, etc. Below is an example of how the dataset may lookΒ like.
Your task is to build a model, that classifies images into βGoodβ / βAnomalyβ classes and returns a bounding box for the defect if the image is classified as an βAnomalyβ. Even though this task may look simple, like a typical object detection task, there is an issueβββwe do not have labels for boundingΒ boxes.
Fortunately, this task is solvable.
Training Pipeline
Disclosure: I am not sharing my real commercial project, but showing how to explain the classification model predictions in general, so this may be used in many domains and tasksβββnot only manufacturing but medicine as well. I should also say that do not expect high accuracy here, because itβs my quick pet project. But you are free to use my results as a starting point for your project, invest more time and achieve the accuracy you needΒ :Β )
Data Preparation
For all my experiments Iβve used MVTEC Anomaly Detection Dataset (pay attention, it is distributed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, which means that it cannot be used for commercial purposes).
The dataset includes 15 subsets of different item types, such as Bottle, Cable, Pill, Leather, Tile, etc; each subset has 300β400 images totalβββeach labeled as βGoodβ/βAnomalyβ.
As a data preprocessing step, resize images to 224Γ224 pixels to speed up training. Images in most subsets are of size 1024Γ1024, but as defects are also of the large size, we may resize the image to a lower resolution without sacrificing model accuracy.
Consider using Data Augmentations. In general, appropriate data augmentations are ALWAYS beneficial for your model (BTW, check my post on data augmentation to learnΒ more).
But letβs assume that when deployed to production our model will βseeβ the data of exactly the same format as in the dataset we have now. So, if images are centered, scaled, and rotated (as in Capsule and Cable subsets), we may not use any data augmentations at all, because test images are expected to be also centered, scaled, and rotated. However, if mages are not rotated (but only centered and scaled), as in Screw and Metal Nut subsets, adding Rotation as a preprocessing step to the training pipeline would help the model learnΒ better.
Split the data into train/test parts. Ideally, we would like to have train, validation, and test partsβββto train a model, tune hyperparameters and evaluate model accuracy, respectively. But we have only 300β400 images, so letβs put 80% of images into the train set and 20%βββinto the test set. For small datasets, we may perform 5-Fold cross-validation to make sure that evaluation results areΒ robust.
When dealing with an imbalanced dataset, train/test split should be performed in a stratified manner, so train and test parts will contain the same share of both classesββββGoodβ/βAnomalyβ. Additionally, if you have information on the defect types (such as scratch, crack, etc), itβs better to do a stratified split based also on defect types, so train and test parts will contain the same share of items with scratches/cracks.
Model
Letβs take VGG16 pre-trained on ImageNet, and change its classification headβββreplace Flattening and Dense layers with Global Average Pooling and a single Dense layer. Iβll explain in section βInference Pipelineβ why we need these particular layers.
(This approach Iβve found in the paper Learning Deep Features for Discriminative Localization. In this post, Iβll go through all the important steps described in theΒ paper.)
We train the model as a typical 2-class classification model. The model outputs a 2-dimensional vector that contains probabilities for classes βGoodβ and βAnomalyβ (with 1-dimensional output, the approach should also work, feel free toΒ try).
During training, the first 10 convolutional layers are frozen, we train only the classification head and the last 3 convolutional layers. Thatβs is because our dataset is too small to finetune the whole model. Loss is Cross-Entropy; optimizer is Adam with a learning rate ofΒ 0.0001.
Iβve experimented with different subsets of the MVTEC Anomaly Detection Dataset. Iβve trained the model with batch_size=10 for at most 10 epochs and early stopping when train set accuracy reaches 98%. To deal with the imbalanced dataset, we may apply loss weighting: use higher weight for βAnomalyβ class images and lowerβββforΒ βGoodβ.
Inference Pipeline
During inference, we want not only to classify an image into βGoodβ / βAnomalyβ classes but also to get a bounding box for the defect if the image is classified as an βAnomalyβ.
For this reason, we make the model in inference mode to output class probabilities as well as the heatmap, which later will be processed into the bounding box. Heatmap is created from the feature maps from deepΒ layers.
Step 1. Take all feature maps from Conv5β3 layer, after ReLU activation. For a single input, there will be 512 feature maps of size 14Γ14 (input image of size 224Γ224 was downsampled each time twice by 4 PoolingΒ layers).
Step 2. Sum up all 512 feature maps from the Conv5β3 layer each multiplied by the weight in the Dense layer that affected the calculation of the βAnomalyβ class score. Carefully look at Images 7 and 8 to understand thisΒ step.
Why so? Now youβll see why the classification head should have a Global Average Pooling Layer and a Dense Layer. Such architecture makes it possible to follow, what feature maps (and how much) affected the final prediction and made it to be an βAnomalyβ class.
Each feature map (output of layer Conv5β3; see Image 6) highlights some regions in the input image. The Global Average Pooling layer represents each feature map as a single number (we may think about it as 1-D embedding). The dense layer calculates scores (and probabilities) for classes βGoodβ and βAnomalyβ by multiplying each embedding by the corresponding weight. This flow is shown in ImageΒ 7.
So Dense layer weights represent how much each feature map affects the scores for βGoodβ and βAnomalyβ classes (we are interested in the βAnomalyβ class score only). And summing up feature maps from layer Conv5β3 each multiplied by corresponding weight from the Dense layerβββmakes a lot ofΒ sense.
Interestingly, using Global Average Pooling but not Global Max Pooling is crucial to make the model find the whole object. Here is what the original paper Learning Deep Features for Discriminative Localization says:
βWe believe that Global Average Pooling loss encourages the network to identify the extent of the object as compared to Global Max Pooling which encourages it to identify just one discriminative part. This is because, when doing the average of a map, the value can be maximized by finding all discriminative parts of an object as all low activations reduce the output of the particular map. On the other hand, for Global Max Pooling, low scores for all image regions except the most discriminative one do not impact the score as you just perform aΒ max.β
Step 3. The next step is to upsample the heatmap to match the input image sizeβββ224Γ224. Bilinear upsampling is okay, like any other upsampling method.
Coming back to the model output. The model returns probabilities for classes βGoodβ and βAnomalyβ and a heatmap that shows what pixels were important when calculating the βAnomalyβ score. Models return the heatmap always, no matter it classified the image as βGoodβ or βAnomalyβ; when class is βGoodββββwe just ignore theΒ heatmap.
The heatmaps look quite well (see Image 11), and explain what region made the model decide that the image belongs to the βAnomalyβ class. We may stop here, or (as I promised) process the heatmap into a boundingΒ box.
From heatmaps to bounding boxes. You may come up with several approaches here. Iβll show you the simplest one. In most cases, it works prettyΒ well.
1. First, normalize the heatmap, so all the values are in the rangeΒ [0,1].
2. Select a threshold. Apply it to the heatmap, so all values larger than the threshold are transformed into 1s and smallerβββinto 0s. The larger the thresholdβββthe smaller the bounding box will be. I like how the results look when the threshold is in the range [0.7,Β 0.9].
3. We assume, that region of 1sβββis a single dense region. Then plot a bounding box around the region, by finding argmin and argmax in heights and width dimensions.
However, pay attention that this approach can only return a single bounding box (by definition), so it would fail if the image has multiple defective regions.
Evaluation
Letβs evaluate the approach on 5 subsets from the MVTEC Anomaly Detection DatasetβββHazelnut, Leather, Cable, Toothbrush, andΒ Pill.
For each subset, Iβve trained a separate model; 20% of images were selected as a test setβββrandomly and in a stratified manner. No data augmentations were used. I applied class weighing in loss functionβββ1 for βGoodβ class and 3 for βAnomalyβ, because in most subsets there are 3 times more good images than anomalous ones. The model was trained for at most 10 epochs with early stopping if train set accuracy reaches 98%. Here is my notebook with the trainingΒ script.
Below are the evaluation results. Train set size for subsets is 80β400 images. Balanced Accuracy is between 81.7% and 95.5%. Some subsets, such as Hazelnut and Leather, are easier for models to learn, while Pill is a relatively hardΒ subset.
Thatβs it with numbers, and now letβs see how predictions look like. In most cases model produces correct class prediction and precise bounding box if the class is an βAnomalyβ. However, there are some errors: they are either incorrect class prediction or wrong bounding box location when class is correctly predicted as an βAnomalyβ.
Conclusion
In this post, I wanted to show you that neural networks are not black-box algorithms as some may think, but are quite explainable when you know where to lookΒ π And the approach described here is one of the many ways of how to explain your model predictions.
Of course, the model is not that accurate, mostly because it is my quick pet project. But if you would work on a similar task, feel free to take my results as a starting point, invest more time and get the accuracy youΒ need.
I am open-sourcing the code for this project to this Github repository. Feel free to use my results as a starting point for your projectΒ π
Whatβs next?
If youβd like to improve the accuracy of this Anomaly Detection model, adding data augmentationβββis the place to start. I recommend you to read my postβββComplete Guide to Data Augmentation for Computer Vision. There youβll find how to use Data Augmentations to benefit your model, and not to do harmΒ π
In case you are interested in case studies, check my tutorialβββGentle introduction to 2D Hand Pose Estimation: Approach Explained.
And subscribe to my Twitter or Telegram not to miss my new postsΒ π
References
[1] Zhou, Bolei, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba: Learning deep features for discriminative localization; in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016.Β pdf
[2] Paul Bergmann, Kilian Batzner, Michael Fauser, David Sattlegger, Carsten Steger: The MVTec Anomaly Detection Dataset: A Comprehensive Real-World Dataset for Unsupervised Anomaly Detection; in: International Journal of Computer Vision, January 2021.Β pdf
[3] Paul Bergmann, Michael Fauser, David Sattlegger, Carsten Steger: MVTec ADβββA Comprehensive Real-World Dataset for Unsupervised Anomaly Detection; in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.Β pdf
Explainable Defect Detection Using Convolutional Neural Networks: Case Study was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.
Join thousands of data leaders on the AI newsletter. Itβs free, we donβt spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI