Matching Guided Distillation

Kaiyu Yue, Jiangfan Deng, Feng Zhou
Algorithm Research, Aibee Inc.
arXiv Slides code/models in PyTorch Cat Paper Collection Zhihu Blog


Update
  • Jul 28, 2021. We update the experiments for unsupervised learning using MoCo-v2. Please check out the code and results.

  • Figure. T : teacher feature tensors. S : student feature tensors. dp : distance function for distillation. Ci : i-th channel.

    MGD
    MGD is a novel distillation method, which works with intermediate features between teacher and student. MGD transforms the intermediate features by matching guided mechanism. It gets rid of the restrict of channels number without adaptive modules.
    Given batches of data fed into teacher and student, MGD calculates the pair similarity between two feature sets from distillation position. The motivation is that whether the student has been pre-trained or not, each channel of it should be guided by its high related teacher channel to directly narrow the semantic feature gap.
    So, the next procedure is going to help each student channel find its high related teacher channels. Commonly, teacher is wider and larger than student. One student channel could be match multiple teacher channels. We treat this assignment problem as a Linear Assignment (LA), which can be efficiently solved with Hungarian algorithm.
    After finishing the step of matching, we start to reduce channels that are matched with a same student channel for calculating partial distance loss value. We propose an efficient method for feature reduction and aggregation —— Absolute Max Pooling (AMP), which takes the maximum activations in the same spatial position, including positive (usable) and negative (adverse) information.
    As shown in Figure. at the beginning of this page, to optimize the whole system, we use Coordinate Descent Optimization algorithm by alternating between solving the combinatorial matching/assignment problem and updating student network weights. Postulating the matching is solved, we employ Stochastic Gradient Descent (SGD) to update student network weights as usual. After several epochs or iterations of training with SGD, the student is switched into evaluation mode without learning. Then we feed a dataset that is randomly sampled from training data into student and teacher, in order to update matching flow matrix.
    Video
    Long Intro Video. Slides
    Note: If you can't access the YouTube, please watch this video on .
    Results
    We run a number of major experiments in the tasks of large-scale classification with supervised / unsupervised learning, fine-grained recognition with transfer learning, detection and instance segmentation. In following tables, we summarize the main experimental results from paper.

  • Unsupervised Learning with MoCo-v2 on ImageNet-1K
  • model method MoCo epochs top-1 acc. top-5 acc.
    ResNet-50 Teacher 200 67.5 -
    ResNet-34 Student 200 57.2 81.5
    MGD - AMP 200 58.5 82.7
    ResNet-18 Student 200 52.5 77.0
    MGD - AMP 200 53.6 78.7

  • Transfer Learning on CUB-200
  • model method top-1 err. top-5 err.
    ResNet-50 Teacher 20.02 6.06
    MobileNet-V2 Student 24.61 7.56
    MGD - AMP 20.47 5.23
    ShuffleNet-V2 Student 31.39 10.9
    MGD - AMP 25.95 7.46

  • Large-Scale Classification on ImageNet-1K
  • model method top-1 err. top-5 err.
    ResNet-152 Teacher 21.69 5.94
    ResNet-50 Student 23.85 7.13
    MGD - SM 22.02 5.68
    ResNet-50 Teacher 23.85 7.13
    MobileNet-V1 Student 31.13 11.24
    MGD - AMP 28.53 9.65

  • Object Detection on COCO
  • model lr sched / scale method box-AP box-AP50
    RetinaNet R-50 1x / multi-scale Teacher 37.01 56.03
    RetinaNet R-18 1x / multi-scale Student 30.78 47.88
    MGD - AMP 31.38 48.79
    Paper
    paper-snapshot
                Kaiyu Yue, Jiangfan Deng, Feng Zhou

                Matching Guided Distillation

                ECCV 2020 (Poster)

                Poster Number 4082

                arXiv ECVA Version
    Code
    We implement MGD in PyTorch with supporting two modes: DataParallel (DP) and DistributedDataParallel (DDP). MGD for object detection is also re-implemented in Detectron2 as an external project.

    Citation
    
            @inproceedings{eccv20mgd,
                  title     = {Matching Guided Distillation},
                  author    = {Yue, Kaiyu and Deng, Jiangfan and Zhou, Feng},
                  booktitle = {European Conference on Computer Vision (ECCV)},
                  year      = {2020}
            }
            
    Cat?
    Coincidently, there is a cat living in the paper. Please check out Cat Paper Collection maintained by Prof. Jun-Yan Zhu.
    Prince Michael @ Instagram
    Contact
    For questions and communications, feel free to drop emails into kaiyuyue [at] gmail.com.

    © 2020 Kaiyu Yue