Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Faster Knowledge Distillation Using Uncertainty-Aware Mixup
Computer Vision   Latest   Machine Learning

Faster Knowledge Distillation Using Uncertainty-Aware Mixup

Last Updated on November 10, 2024 by Editorial Team

Author(s): Tata Ganesh

Originally published on Towards AI.

Photo by Jaredd Craig on Unsplash

In this article, we will review the paper titled β€œComputation-Efficient Knowledge Distillation via Uncertainty-Aware Mixup” [1], which aims to reduce the computational cost associated with distilling the knowledge of computer vision models.

Disclaimer: This paper’s arxiv draft was published in 2020, so some of the teacher models mentioned in the results are small models by today’s standards.

Knowledge Distillation

Knowledge distillation (KD) is the process of transferring learning from a larger model (called the teacher) to a smaller model (called the student). It is used to create compressed models that can be run on resource-constrained environments. Further, KD yields a more accurate model compared to a model that is trained from scratch. In the original knowledge distillation paper by Hinton et al. [2], the student model is trained using the output logits from the teacher model for each training sample. The ground-truth labels are also included during training if they are available. This process is illustrated below.

Knowledge Distillation framework. Figure by author. Dog image from CIFAR-10 dataset [3]

Computational Cost of Knowledge Distillation

First, let us define the different floating point operations that contribute to KD’s computational cost. Note that these operations are defined per image.
Fβ‚œ = Teacher forward pass (to get output logits from teacher model)
Fβ‚› = Student forward pass (to get output logits from student model)
Bβ‚› = Student backward pass (to update weights of the student model)

The breakdown of the typical KD process for a mini-batch of N images is as follows:

  • A mini-batch of N images is passed through the teacher and the student models. The cost of this forward pass is Fβ‚œ + Fβ‚›.
  • A distillation loss is applied between the teacher and the student models for different layers.
  • The student model’s weights are updated during the backward pass. The cost of this backward pass is Bβ‚›.
  • Note: Since the teacher model is much larger than the student model, we can assume that Fβ‚œ >> Fβ‚›, Fβ‚œ >> Bβ‚› and Fβ‚› = Bβ‚›.

This process can be summarized using the following figure:

Framework of Knowledge Distillation [1]

Hence, the total cost of KD for a mini-batch of N images is:

Computational Cost of KD [1]

Reducing the number of images passed to the teacher model can lead to an overall reduction in the computational cost of KD. So, how can we sample images from each mini-batch to reduce the cost associated with the teacher model’s forward pass operation? Katharopoulos et al. [4] claim that all samples in a dataset are not equally important for neural network training. They propose an importance sampling technique to focus computation on β€œinformative” examples during training. Similarly, the importance or informativeness of examples in a mini-batch can be used to sample only informative examples and pass them to the teacher model. In the next section, we will discuss how the proposed method, named UNIX, performs this sampling.

UNcertainty-aware mIXup (UNIX)

UNIX Framework [1]

The sequence of steps for each mini-batch in UNIX is as follows:

Step 1: Student forward pass
Each mini-batch of images is fed to the student model to obtain the predicted class probabilities for each image.

Step 2: Uncertainty Estimation
For each image, the predicted probabilities are used to generate an uncertainty estimate. The uncertainty value loosely indicates the prediction confidence of the student model for each image. The higher the uncertainty, the lower the confidence. Based on Active Learning literature [5], uncertainty can be used to estimate the informativeness of each image. For example, the authors use entropy of the student model’s predicted probability distribution to quantify uncertainty.

Uncertainty quantification using entropy [1]

Step 3: Shuffling and Sorting the mini-batch
The mini-batch is then sorted in decreasing order of sample uncertainties. Let us name the sorted mini-batch Bsorted. Further, the original mini-batch is shuffled. Let us name the shuffled mini-batch Bshuffled.

Step 4: Uncertainty-Aware Mixup
Mixup [6] is a data augmentation technique that performs a convex combination of two images and their corresponding labels in a mini-batch. Mixup has been shown to improve the generalization of neural networks.

Mixup Data Augmentation [6]. Ξ» is used to control the magnitude of mixup.

The authors propose to use mixup as a way to compress information from two images into one, then feed the mixed image to the teacher and student models for KD. An element-wise mixup is performed between images in Bsorted and Bshuffled. Specifically,

Performing mixup based on sample uncertainty [1]

Here, c is a correction factor, which is a function of each sample’s uncertainty. c ensures that mixup is mild for uncertain samples and strong for confident samples. Note that labels are NOT mixed.

Step 5: Sampling and Teacher forward pass
After performing mixup, k images are sampled from the N mixed images. These k mixed images are fed as input to the teacher and student models for KD.

Comparing Computational Costs

Consider the case where batch size N = 64 and k = 40. Then, the computational cost of a forward pass for a mini-batch with and without UNIX is (Note that the final cost is expressed with respect to the student model) :

Example of Computation Cost of KD with and without UNIX. Figure by Author.

In our example, KD with UNIX yields a ~25% reduction in computational cost, improving the computational efficiency of the distillation process.

Results

CIFAR-100 Results
Results of different model architectures on the CIFAR-100 [2] image classification dataset are shown below.

KD results on CIFAR-100 [1]. WRN means Wide Resnet [7].

In most cases, the performance of UNIXKD is on par with original KD. Specifically, UNIXKD with k=36 provides a good tradeoff between accuracy and computational cost. Further, random sampling with KD (Random+KD) performs on par or worse than UNIXKD for all model architectures, highlighting the importance of uncertainty-based sampling in improving computational efficiency with minimal reduction in accuracy.

ImageNet results
Results on the ImageNet [8] dataset are shown below.

KD results on ImageNet[1].

The columns with β€œ+label” specify KD with ground truth labels. For experiments with and without ground truth labels, UNIXKD performs on par with original KD while reducing the total computational cost by ~23%.

Conclusion

Knowledge Distillation is a technique used for transferring the knowledge of a large teacher model into a small student model. However, the high computational cost of performing a forward pass through the teacher model makes the distillation process computationally expensive. To tackle this problem, UNcertainty-aware mIXup (UNIX) uses uncertainty sampling and the mixup augmentation technique to pass a smaller number of images to the teacher model. Experiments on CIFAR 100 and ImageNet datasets show that UNIX can reduce the computational cost of knowledge distillation by 25% with minimal reduction in classification performance.

References

[1] G. Xu, Z. Liu, and C. Change Loy. Computation-Efficient Knowledge Distillation via Uncertainty-Aware (2020), arXiv preprint arXiv:2012.09413.

[2] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network (2015), arXiv preprint arXiv:1503.02531.

[3] A. Krizhevsky and G. Hinton. Learning multiple
layers of features from tiny images
(2009).

[4] A. Katharopoulos and F. Fleuret. Not all sam-
ples are created equal: Deep learning with importance sam-
pling
(2018), International conference on machine learning. PMLR.

[5] B. Settles. Active learning literature survey (2010), University of Wisconsin, Madison, 52(55–66):11.

[6] H. Zhang, M. Cisse, Y. Dauphin, and D. Lopez-Paz. mixup: Beyond
empirical risk minimization
(2018), 6th International Conference on Learning Representations.

[7] S. Zagoruyko and N. Komodakis. Wide Residual Networks (2017), arXiv preprint arXiv:1605.07146.

[8] J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei- Fei. Imagenet: A large-scale hierarchical image database (2009), IEEE Conference on Computer Vision and Pattern Recognition.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓