Paper Explained- Knowledge distillation: A good teacher is patient and consistent

5 min readJun 18, 2021

Evaluating the fidelity of knowledge distillation. The effect of enlarging the CIFAR-100 distillation dataset with GAN-generated samples. (a): The student and teacher are both single ResNet-56 networks. Student fidelity increases as the dataset grows, but test accuracy decreases. (b): The student is a single ResNet-56 network and the teacher is a 3-component ensemble. Student fidelity again increases as the dataset grows, but test accuracy now slightly increases. The shaded region corresponds to µ ± σ, estimated over 3 trials. The image is taken from page 2 of the paper.

Introduction & Overview

The motivating problem for this paper is model compression and the authors found something really speculating, according to the download counts of five BiT models from Tensorflow Hub, the smallest ResNet-50 model has been downloaded for significantly more times than the larger ones. As a result, many recent improvements in vision do not translate to real-world applications. So the recent advances in neural vision architectures and the benefits that come along with scaling up the size of the model have made many vision tasks reach SOTA but this isn’t accessible to most people, so we need to figure out how to compress these heavy vision models into models like ResNet-50 through this knowledge distillation algorithm, so more and more people can actually use these advances in their research and applications.

Another approach to compress deep learning models is something known as pruning but the authors note some issues with it. Firstly, it does not allow changing the model family, i.e, translating from a ResNet to a MobileNet with the depth wise convolutions and involution layers is not possible. Secondly, there also may be architecture dependent challenges, for example, if the large model uses group normalisation, the pruning channels may result in the need to dynamically re-balance channel groups. So with these issues raised about model pruning, for the sake of compression, we are focusing on knowledge distillation and here is where the main crux starts.

Consistency in Knowledge Distillation

One of the component in Knowledge distillation training recipe is consistency. We start of by first training the student predictions (or internal activations, this could be done at the softmax output layer, logits or internal feature maps) to match those of the teacher.

Schematic illustrations of various design choices when doing knowledge distillation. Left: Teacher receives a fixed image, while student receives a random augmentation. Center-left: Teacher and student receive independent image augmentations. Center-right: Teacher and student receive consistent image augmentations. Right: Teacher and student receive consistent image augmentations plus the input image manifold is extended by including linear segments between pairs of images (known as mixup [50] augmentation). The image is taken from page 2 of the paper.

This image is taking apart different ways to structure the student input image and the teacher input image for the sake of enforcing consistency in knowledge distillation. So the first way of doing is known as fixed teacher. This setting would be to have a fixed teacher where the teacher takes in the entire image as the input and it computes the logits. One benefit of doing this is you can have a pre-computed set of activations, you just pass the entire dataset through to the neural network and compute the teacher activations. It enticing people want to do this because you can just compute the teacher activations and have the file reader read these logits and apply the loss function directly on the dataset and compare it with augmentations for the student matching. The student here receives a random augmentation, it might see a crop, a rotated image, and so on, The student logits are computed using the teacher logits.

In the independent noise setting, both the student and the teacher are seeing the stochastic data augmentation views of the image and then comparing the logits. These two settings are non-consistent trainings compared to the center-right and right settings.

In consistent teaching, the student and the teacher is being viewed the same augmented image and the logits are matched.

In function matching, the authors quote knowledge distillation shouldn’t just be about matching the predictions on this target data and you should try to increase the support of the data distribution. So what they use here is something called as mixup augmentation, you can use out-of-domain data or this sort of mix up data way of interpreting between data points to match the function across the data distribution with an interesting view of the sample.

Perspective of Function Matching (FunMatch)

One interesting example to sort of contrast knowledge distillation with other label augmentation strategy is Meta Pseudo Labels. So in meta pseudo labels, it is a teacher-student gradient-to-gradient operation where the teacher is producing y labels (y hat) with the feedback on the student’s performance on the labeled data.

The difference between Pseudo Labels and Meta Pseudo Labels. Left: Pseudo Labels, where a fixed pre-trained teacher generates pseudo labels for the student to learn from. Right: Meta Pseudo Labels, where the teacher is trained along with the student. The student is trained based on the pseudo labels generated by the teacher (top arrow). The teacher is trained based on the performance of the student on labeled data (bottom arrow). The image is taken from page 2 of the paper.

MixUp Augmentation

What MixUp augmentation is doing it takes two images and then averages the pixels together. So you have this kind of blurry overlap of images in the fashion-MNIST dataset. This strategy focuses on expanding the support of the data distribution, it is different from the normal augmentations where it is kind of encoding the translation, rotational and varience thresholding. In MixUp augmentation, it is encoding interpolation paths between points in the dataset, this makes it a different kind of data augmentation strategy.

Image from https://blog.airlab.re.kr/2019/11/mixup

Patience in Knowledge Distillation

Another component of the Knowledge distillation training recipe is patience. Knowledge distillation benefits from long training schedules.

The image is taken from page 5 of the paper.

The image above is showing the number of steps/no.of epochs as you continue the distillation training from the larger model into the student ResNet-50 model . One needs patience along with consistency when doing distillation. Eventually, the teacher will be matched; this is true across various datasets of different scale as shown above.

**Left**: Top-1 accuracy on ImageNet of three distillation setups: (1) fixed teacher; (2) consistent teaching; (3) function matching (“FunMatch”). Light color curves show accuracy throughout training, while the solid scatter plots are the final results. The student with a fixed teacher eventually saturates and overfits to it. Both consistent teaching and function matching do not exhibit overfitting or saturation. **Middle**: Reducing the optimization cost, via Shampoo preconditioning; with 1200 epochs, it is able to match the baseline trained for 4800 epochs. **Right**: Initializing student with pre-trained weights improves short training runs, but eventually harms for the longest schedules. The image is taken from page 6 of the paper.

So in this plot we are looking further into the patience idea and comparing the results with consistent teaching, function matching and fixed teacher setups.

Results

T stands for teacher and S stands for student, numbers next to T and S indicate crop size taken as input for teacher and student respectively. The image is taken from page 7 of the paper.

The high level take away is that with this prescription of patience and consistency using the 3 mentioned strategies (function matching, fixed teacher and consistent teaching), the authors of this paper were able to distil BiT-L into a ResNet-50 achieving 82.8% ImageNet top-1 accuracy (+2.2% from the previous ResNet-50 SOTA).

If you enjoyed this article and gained insightful knowledge, consider buying me a coffee ☕️ by clicking here. 🤤

References

If you liked this post, please make sure to clap 👏. 💬 Connect? Let’s get social: http://myurls.co/nakshatrasinghh.