Matching Guided Distillation

Kaiyu Yue, Jiangfan Deng, Feng Zhou
Algorithm Research, Aibee Inc.
arXiv Slides code/models in PyTorch Cat Paper Collection

Figure. T : teacher feature tensors. S : student feature tensors. dp : distance function for distillation. Ci : i-th channel.

Feature distillation is an effective way to improve the performance for a smaller student model, which has fewer parameters and lower computation cost compared to the larger teacher model. Unfortunately, there is a common obstacle — the gap in semantic feature structure between the intermediate features of teacher and student.

The classic scheme prefers to transform intermediate features by adding the adaptation module, such as naive convolutional, attention-based or more complicated one. However, this introduces two problems: a) The adaptation module brings more parameters into training. b) The adaptation module with random initialization or special transformation isn't friendly for distilling a pre-trained student.

In this paper, we present Matching Guided Distillation (MGD) as an efficient and lightweight manner to solve these problems. The key idea of MGD is to pose matching the teacher channels with students' as an assignment problem. We compare three solutions of the assignment problem to reduce channels from teacher features with partial distillation loss. The overall training takes a coordinate-descent approach between two optimization objects — assignments update and parameters update. Since MGD only contains normalization or pooling operations with negligible computation cost, it is flexible to plug into network with other distillation methods, such as KD.
MGD is a novel distillation method, which works with intermediate features between teacher and student. MGD transforms the intermediate features by matching guided mechanism. It gets rid of the restrict of channels number without adaptive modules.
Given batches of data fed into teacher and student, MGD calculates the pair similarity between two feature sets from distillation position. The motivation is that whether the student has been pre-trained or not, each channel of it should be guided by its high related teacher channel to directly narrow the semantic feature gap.
So, the next procedure is going to help each student channel find its high related teacher channels. Commonly, teacher is wider and larger than student. One student channel could be match multiple teacher channels. We treat this assignment problem as a Linear Assignment (LA), which can be efficiently solved with Hungarian algorithm.
After finishing the step of matching, we start to reduce channels that are matched with a same student channel for calculating partial distance loss value. We propose three efficient reduction methods: Sparse Matching (SM), Random Drop (RD) and Absolute Max Pooling (AMP). Among these three manners, AMP is the best choice for our method. It takes the maximum activations in the same spatial position, including positive (usable) and negative (adverse) information.
As shown in Figure. at the beginning of this page, to optimize the whole system, we use Coordinate Descent Optimization algorithm by alternating between solving the combinatorial matching/assignment problem and updating student network weights. Postulating the matching is solved, we employ Stochastic Gradient Descent (SGD) to update student network weights as usual. After several epochs or iterations of training with SGD, the student is switched into evaluation mode without learning. Then we feed a dataset that is randomly sampled from training data into student and teacher, in order to update matching flow matrix.
Long Intro Video. Slides
Note: If you can't access the YouTube, please watch this video on .
We run a number of major experiments in the tasks of large-scale classification, fine-grained recognition with transfer learning, detection and instance segmentation. In following tables, we summarize the main experimental results from paper.

  • Transfer Learning on CUB-200
  • model method top-1 err. top-5 err.
    ResNet-50 Teacher 20.02 6.06
    MobileNet-V2 Student 24.61 7.56
    MGD - AMP 20.47 5.23
    ShuffleNet-V2 Student 31.39 10.9
    MGD - AMP 25.95 7.46

  • Large-Scale Classification on ImageNet-1K
  • model method top-1 err. top-5 err.
    ResNet-152 Teacher 21.69 5.94
    ResNet-50 Student 23.85 7.13
    MGD - SM 22.02 5.68
    ResNet-50 Teacher 23.85 7.13
    MobileNet-V1 Student 31.13 11.24
    MGD - AMP 28.53 9.65

  • Object Detection on COCO
  • model lr sched / scale method box-AP box-AP50
    RetinaNet R-50 1x / multi-scale Teacher 37.01 56.03
    RetinaNet R-18 1x / multi-scale Student 30.78 47.88
    MGD - AMP 31.38 48.79
                Kaiyu Yue, Jiangfan Deng, Feng Zhou

                Matching Guided Distillation

                ECCV 2020 (Poster)

                Poster Number 4082

                arXiv ECVA Version
    We implement MGD in PyTorch with supporting two modes: DataParallel (DP) and DistributedDataParallel (DDP). MGD for object detection is also re-implemented in Detectron2 as an external project.

                  title={Matching Guided Distillation},
                  author={Yue, Kaiyu and Deng, Jiangfan and Zhou, Feng},
                  booktitle={European Conference on Computer Vision (ECCV)},
    Coincidently, there is a cat living in the paper. Please check out Cat Paper Collection maintained by Prof. Jun-Yan Zhu.
    Prince Michael @ Instagram
    For questions and communications, feel free to drop emails into kaiyuyue [at]

    © 2020 Kaiyu Yue