Preparing lessons: Improve knowledge distillation with better supervision

Research output: Contribution to journalArticlepeer-review

51 Scopus citations

Abstract

Knowledge distillation (KD) is widely applied in the training of efficient neural network. A compact model, which is trained to mimic the representation of a cumbersome model for the same task, generally obtains a better performance compared with being trained with the ground truth label. Previous KD-based works mainly focus on two aspects: (1) designing various feature representation for knowledge transfer; (2) introducing different training mechanism such as progressive learning or adversarial learning. In this paper, we revisit the standard KD and observe that training with teacher's logits might suffer from incorrect and uncertain supervision. To tackle these problems, we propose two novel approaches to deal with incorrect logits and uncertain logits respectively, which are called Logits Adjustment (LA) and Dynamic Temperature Distillation (DTD). To be specific, LA rectifies the incorrect logits according to ground truth label and certain rules. While DTD treats the temperature of KD as a dynamic sample wise parameter rather than a static and global hyper-parameter, which actually notes the uncertainty for each sample's logits. With iteratively updating the sample wise temperature, the student model could pay more attention on the samples that confuse the teacher model. Experiments on CIFAR-10/100, CINIC-10 and Tiny ImageNet verify that the proposed methods yield encouraging improvement compared with the standard KD. Furthermore, considering the simple implementations, LA and DTD can be easily attached to many KD-based frameworks and bring improvements without extra cost of training time and computing resources.

Original languageEnglish
Pages (from-to)25-33
Number of pages9
JournalNeurocomputing
Volume454
DOIs
StatePublished - 24 Sep 2021

Keywords

  • Hard example mining
  • Knowledge distillation
  • Label regularization

Fingerprint

Dive into the research topics of 'Preparing lessons: Improve knowledge distillation with better supervision'. Together they form a unique fingerprint.

Cite this