 Distillation refers to a variety of techniques that take a larger network and use it to construct a smaller network that has similar properties. The basic idea of distillation is to first train a standard large neural net, the teacher, then take that teacher and apply it to use the larger set of unlabeled data, unlabeled data being easy to come by, thereby generating a larger set of soft labels, estimated labels, surrogate labels, and often along the way you can also look at the outputs of the hidden nodes, so you have not just what does the neural net predict for the final output but what features has the neural net used internally to reach those final nodes, then you take either those soft labels or the soft labels and the learned features and use those to train a new student network. And people have found over and over again that this is a good way to regularize by having trained on a much larger data set and that you can often then train a smaller network that is at least as good and sometimes better than the teacher at generalizing to future data sets.