Faster Knowledge Distillation Using Uncertainty-Aware Mixup

Last Updated on November 10, 2024 by Editorial Team

Author(s): Tata Ganesh

Originally published on Towards AI.

Faster Knowledge Distillation Using Uncertainty-Aware Mixup — Photo by Jaredd Craig on Unsplash

In this article, we will review the paper titled “Computation-Efficient Knowledge Distillation via Uncertainty-Aware Mixup” [1], which aims to reduce the computational cost associated with distilling the knowledge of computer vision models.

Disclaimer: This paper’s arxiv draft was published in 2020, so some of the teacher models mentioned in the results are small models by today’s standards.

Knowledge Distillation

Knowledge distillation (KD) is the process of transferring learning from a larger model (called the teacher) to a smaller model (called the student). It is used to create compressed models that can be run on resource-constrained environments. Further, KD yields a more accurate model compared to a model that is trained from scratch. In the original knowledge distillation paper by Hinton et al. [2], the student model is trained using the output logits from the teacher model for each training sample. The ground-truth labels are also included during training if they are available. This process is illustrated below.

Computational Cost of Knowledge Distillation

First, let us define the different floating point operations that contribute to KD’s computational cost. Note that these operations are defined per image.
Fₜ = Teacher forward pass (to get output logits from teacher model)
Fₛ = Student forward pass (to get output logits from student model)
Bₛ = Student backward pass (to update weights of the student model)

The breakdown of the typical KD process for a mini-batch of N images is as follows:

A mini-batch of N images is passed through the teacher and the student models. The cost of this forward pass is Fₜ + Fₛ.
A distillation loss is applied between the teacher and the student models for different layers.
The student model’s weights are updated during the backward pass. The cost of this backward pass is Bₛ.
Note: Since the teacher model is much larger than the student model, we can assume that Fₜ >> Fₛ, Fₜ >> Bₛ and Fₛ = Bₛ.

This process can be summarized using the following figure:

Hence, the total cost of KD for a mini-batch of N images is:

Reducing the number of images passed to the teacher model can lead to an overall reduction in the computational cost of KD. So, how can we sample images from each mini-batch to reduce the cost associated with the teacher model’s forward pass operation? Katharopoulos et al. [4] claim that all samples in a dataset are not equally important for neural network training. They propose an importance sampling technique to focus computation on “informative” examples during training. Similarly, the importance or informativeness of examples in a mini-batch can be used to sample only informative examples and pass them to the teacher model. In the next section, we will discuss how the proposed method, named UNIX, performs this sampling.

UNcertainty-aware mIXup (UNIX)

The sequence of steps for each mini-batch in UNIX is as follows:

Step 1: Student forward pass
Each mini-batch of images is fed to the student model to obtain the predicted class probabilities for each image.

Step 2: Uncertainty Estimation
For each image, the predicted probabilities are used to generate an uncertainty estimate. The uncertainty value loosely indicates the prediction confidence of the student model for each image. The higher the uncertainty, the lower the confidence. Based on Active Learning literature [5], uncertainty can be used to estimate the informativeness of each image. For example, the authors use entropy of the student model’s predicted probability distribution to quantify uncertainty.

Uncertainty quantification using entropy [1]

Step 3: Shuffling and Sorting the mini-batch
The mini-batch is then sorted in decreasing order of sample uncertainties. Let us name the sorted mini-batch Bsorted. Further, the original mini-batch is shuffled. Let us name the shuffled mini-batch Bshuffled.

Step 4: Uncertainty-Aware Mixup
Mixup [6] is a data augmentation technique that performs a convex combination of two images and their corresponding labels in a mini-batch. Mixup has been shown to improve the generalization of neural networks.

Mixup Data Augmentation [6]. λ is used to control the magnitude of mixup.

The authors propose to use mixup as a way to compress information from two images into one, then feed the mixed image to the teacher and student models for KD. An element-wise mixup is performed between images in Bsorted and Bshuffled. Specifically,

Performing mixup based on sample uncertainty [1]

Here, c is a correction factor, which is a function of each sample’s uncertainty. c ensures that mixup is mild for uncertain samples and strong for confident samples. Note that labels are NOT mixed.

Step 5: Sampling and Teacher forward pass
After performing mixup, k images are sampled from the N mixed images. These k mixed images are fed as input to the teacher and student models for KD.

Comparing Computational Costs

Consider the case where batch size N = 64 and k = 40. Then, the computational cost of a forward pass for a mini-batch with and without UNIX is (Note that the final cost is expressed with respect to the student model) :

Example of Computation Cost of KD with and without UNIX. Figure by Author.

In our example, KD with UNIX yields a ~25% reduction in computational cost, improving the computational efficiency of the distillation process.

Results

CIFAR-100 Results
Results of different model architectures on the CIFAR-100 [2] image classification dataset are shown below.

In most cases, the performance of UNIXKD is on par with original KD. Specifically, UNIXKD with k=36 provides a good tradeoff between accuracy and computational cost. Further, random sampling with KD (Random+KD) performs on par or worse than UNIXKD for all model architectures, highlighting the importance of uncertainty-based sampling in improving computational efficiency with minimal reduction in accuracy.

ImageNet results
Results on the ImageNet [8] dataset are shown below.

The columns with “+label” specify KD with ground truth labels. For experiments with and without ground truth labels, UNIXKD performs on par with original KD while reducing the total computational cost by ~23%.

Conclusion

Knowledge Distillation is a technique used for transferring the knowledge of a large teacher model into a small student model. However, the high computational cost of performing a forward pass through the teacher model makes the distillation process computationally expensive. To tackle this problem, UNcertainty-aware mIXup (UNIX) uses uncertainty sampling and the mixup augmentation technique to pass a smaller number of images to the teacher model. Experiments on CIFAR 100 and ImageNet datasets show that UNIX can reduce the computational cost of knowledge distillation by 25% with minimal reduction in classification performance.

References

[1] G. Xu, Z. Liu, and C. Change Loy. Computation-Efficient Knowledge Distillation via Uncertainty-Aware (2020), arXiv preprint arXiv:2012.09413.

[2] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network (2015), arXiv preprint arXiv:1503.02531.

[3] A. Krizhevsky and G. Hinton. Learning multiple
layers of features from tiny images (2009).

[4] A. Katharopoulos and F. Fleuret. Not all sam-
ples are created equal: Deep learning with importance sam-
pling (2018), International conference on machine learning. PMLR.

[5] B. Settles. Active learning literature survey (2010), University of Wisconsin, Madison, 52(55–66):11.

[6] H. Zhang, M. Cisse, Y. Dauphin, and D. Lopez-Paz. mixup: Beyond
empirical risk minimization (2018), 6th International Conference on Learning Representations.

[7] S. Zagoruyko and N. Komodakis. Wide Residual Networks (2017), arXiv preprint arXiv:1605.07146.

[8] J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei- Fei. Imagenet: A large-scale hierarchical image database (2009), IEEE Conference on Computer Vision and Pattern Recognition.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Faster Knowledge Distillation Using Uncertainty-Aware Mixup

Author(s): Tata Ganesh

Knowledge Distillation

Computational Cost of Knowledge Distillation

UNcertainty-aware mIXup (UNIX)

Comparing Computational Costs

Results

Conclusion

References

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Faster Knowledge Distillation Using Uncertainty-Aware Mixup

Author(s): Tata Ganesh

Knowledge Distillation

Computational Cost of Knowledge Distillation

UNcertainty-aware mIXup (UNIX)

Comparing Computational Costs

Results

Conclusion

References

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement