Abstract
Knowledge distillation (KD) is widely applied in the training of efficient neural network. A compact model, which is trained to mimic the representation of a cumbersome model for the same task, generally obtains a better performance compared with being trained with the ground truth label. Previous KD-based works mainly focus on two aspects: (1) designing various feature representation for knowledge transfer; (2) introducing different training mechanism such as progressive learning or adversarial learning. In this paper, we revisit the standard KD and observe that training with teacher's logits might suffer from incorrect and uncertain supervision. To tackle these problems, we propose two novel approaches to deal with incorrect logits and uncertain logits respectively, which are called Logits Adjustment (LA) and Dynamic Temperature Distillation (DTD). To be specific, LA rectifies the incorrect logits according to ground truth label and certain rules. While DTD treats the temperature of KD as a dynamic sample wise parameter rather than a static and global hyper-parameter, which actually notes the uncertainty for each sample's logits. With iteratively updating the sample wise temperature, the student model could pay more attention on the samples that confuse the teacher model. Experiments on CIFAR-10/100, CINIC-10 and Tiny ImageNet verify that the proposed methods yield encouraging improvement compared with the standard KD. Furthermore, considering the simple implementations, LA and DTD can be easily attached to many KD-based frameworks and bring improvements without extra cost of training time and computing resources.
| Original language | English |
|---|---|
| Pages (from-to) | 25-33 |
| Number of pages | 9 |
| Journal | Neurocomputing |
| Volume | 454 |
| DOIs | |
| State | Published - 24 Sep 2021 |
Keywords
- Hard example mining
- Knowledge distillation
- Label regularization
Fingerprint
Dive into the research topics of 'Preparing lessons: Improve knowledge distillation with better supervision'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver