Faster Knowledge Distillation Using Uncertainty-Aware Mixup
Last Updated on November 10, 2024 by Editorial Team
Author(s): Tata Ganesh
Originally published on Towards AI.
In this article, we will review the paper titled βComputation-Efficient Knowledge Distillation via Uncertainty-Aware Mixupβ [1], which aims to reduce the computational cost associated with distilling the knowledge of computer vision models.
Disclaimer: This paperβs arxiv draft was published in 2020, so some of the teacher models mentioned in the results are small models by todayβs standards.
Knowledge Distillation
Knowledge distillation (KD) is the process of transferring learning from a larger model (called the teacher) to a smaller model (called the student). It is used to create compressed models that can be run on resource-constrained environments. Further, KD yields a more accurate model compared to a model that is trained from scratch. In the original knowledge distillation paper by Hinton et al. [2], the student model is trained using the output logits from the teacher model for each training sample. The ground-truth labels are also included during training if they are available. This process is illustrated below.
Computational Cost of Knowledge Distillation
First, let us define the different floating point operations that contribute to KDβs computational cost. Note that these operations are defined per image.
Fβ = Teacher forward pass (to get output logits from teacher model)
Fβ = Student forward pass (to get output logits from student model)
Bβ = Student backward pass (to update weights of the student model)
The breakdown of the typical KD process for a mini-batch of N images is as follows:
- A mini-batch of N images is passed through the teacher and the student models. The cost of this forward pass is Fβ + Fβ.
- A distillation loss is applied between the teacher and the student models for different layers.
- The student modelβs weights are updated during the backward pass. The cost of this backward pass is Bβ.
- Note: Since the teacher model is much larger than the student model, we can assume that Fβ >> Fβ, Fβ >> Bβ and Fβ = Bβ.
This process can be summarized using the following figure:
Hence, the total cost of KD for a mini-batch of N images is:
Reducing the number of images passed to the teacher model can lead to an overall reduction in the computational cost of KD. So, how can we sample images from each mini-batch to reduce the cost associated with the teacher modelβs forward pass operation? Katharopoulos et al. [4] claim that all samples in a dataset are not equally important for neural network training. They propose an importance sampling technique to focus computation on βinformativeβ examples during training. Similarly, the importance or informativeness of examples in a mini-batch can be used to sample only informative examples and pass them to the teacher model. In the next section, we will discuss how the proposed method, named UNIX, performs this sampling.
UNcertainty-aware mIXup (UNIX)
The sequence of steps for each mini-batch in UNIX is as follows:
Step 1: Student forward pass
Each mini-batch of images is fed to the student model to obtain the predicted class probabilities for each image.
Step 2: Uncertainty Estimation
For each image, the predicted probabilities are used to generate an uncertainty estimate. The uncertainty value loosely indicates the prediction confidence of the student model for each image. The higher the uncertainty, the lower the confidence. Based on Active Learning literature [5], uncertainty can be used to estimate the informativeness of each image. For example, the authors use entropy of the student modelβs predicted probability distribution to quantify uncertainty.
Step 3: Shuffling and Sorting the mini-batch
The mini-batch is then sorted in decreasing order of sample uncertainties. Let us name the sorted mini-batch Bsorted. Further, the original mini-batch is shuffled. Let us name the shuffled mini-batch Bshuffled.
Step 4: Uncertainty-Aware Mixup
Mixup [6] is a data augmentation technique that performs a convex combination of two images and their corresponding labels in a mini-batch. Mixup has been shown to improve the generalization of neural networks.
The authors propose to use mixup as a way to compress information from two images into one, then feed the mixed image to the teacher and student models for KD. An element-wise mixup is performed between images in Bsorted and Bshuffled. Specifically,
Here, c is a correction factor, which is a function of each sampleβs uncertainty. c ensures that mixup is mild for uncertain samples and strong for confident samples. Note that labels are NOT mixed.
Step 5: Sampling and Teacher forward pass
After performing mixup, k images are sampled from the N mixed images. These k mixed images are fed as input to the teacher and student models for KD.
Comparing Computational Costs
Consider the case where batch size N = 64 and k = 40. Then, the computational cost of a forward pass for a mini-batch with and without UNIX is (Note that the final cost is expressed with respect to the student model) :
In our example, KD with UNIX yields a ~25% reduction in computational cost, improving the computational efficiency of the distillation process.
Results
CIFAR-100 Results
Results of different model architectures on the CIFAR-100 [2] image classification dataset are shown below.
In most cases, the performance of UNIXKD is on par with original KD. Specifically, UNIXKD with k=36 provides a good tradeoff between accuracy and computational cost. Further, random sampling with KD (Random+KD) performs on par or worse than UNIXKD for all model architectures, highlighting the importance of uncertainty-based sampling in improving computational efficiency with minimal reduction in accuracy.
ImageNet results
Results on the ImageNet [8] dataset are shown below.
The columns with β+labelβ specify KD with ground truth labels. For experiments with and without ground truth labels, UNIXKD performs on par with original KD while reducing the total computational cost by ~23%.
Conclusion
Knowledge Distillation is a technique used for transferring the knowledge of a large teacher model into a small student model. However, the high computational cost of performing a forward pass through the teacher model makes the distillation process computationally expensive. To tackle this problem, UNcertainty-aware mIXup (UNIX) uses uncertainty sampling and the mixup augmentation technique to pass a smaller number of images to the teacher model. Experiments on CIFAR 100 and ImageNet datasets show that UNIX can reduce the computational cost of knowledge distillation by 25% with minimal reduction in classification performance.
References
[1] G. Xu, Z. Liu, and C. Change Loy. Computation-Efficient Knowledge Distillation via Uncertainty-Aware (2020), arXiv preprint arXiv:2012.09413.
[2] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network (2015), arXiv preprint arXiv:1503.02531.
[3] A. Krizhevsky and G. Hinton. Learning multiple
layers of features from tiny images (2009).
[4] A. Katharopoulos and F. Fleuret. Not all sam-
ples are created equal: Deep learning with importance sam-
pling (2018), International conference on machine learning. PMLR.
[5] B. Settles. Active learning literature survey (2010), University of Wisconsin, Madison, 52(55β66):11.
[6] H. Zhang, M. Cisse, Y. Dauphin, and D. Lopez-Paz. mixup: Beyond
empirical risk minimization (2018), 6th International Conference on Learning Representations.
[7] S. Zagoruyko and N. Komodakis. Wide Residual Networks (2017), arXiv preprint arXiv:1605.07146.
[8] J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei- Fei. Imagenet: A large-scale hierarchical image database (2009), IEEE Conference on Computer Vision and Pattern Recognition.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI