%0 Journal Article %T Moonshine: Distilling with Cheap Convolutions %W http://arxiv.org/abs/1711.02613 %U http://arxiv.org/abs/1711.02613 %X Model distillation compresses a trained machine learning model, such as a neural network, into a smaller alternative such that it could be easily deployed in a resource limited setting. Unfortunately, this requires engineering two architectures: a student architecture smaller than the first teacher architecture but trained to emulate it. In this paper, we present a distillation strategy that produces a student architecture that is a simple transformation of the teacher architecture. Recent model distillation methods allow us to preserve most of the performance of the trained model after replacing convolutional blocks with a cheap alternative. In addition, distillation by attention transfer provides student network performance that is better than training that student architecture directly on data. %J arXiv:1711.02613 [cs, stat] %A Crowley, Elliot J. %A Gray, Gavin %A Storkey, Amos %D 2017-11-07 %K Computer Science - Computer Vision and Pattern Recognition Computer Science - Learning Statistics - Machine Learning