 it is increasingly common to train a large network and then compress it, prune it, shrink it, and use that to derive a smaller network, often one that's just as accurate. Sometimes this is done because one wants to reduce the cost of running the network. Increasingly, people are trying to run deep learning networks on your phones. For example, for on-phone speech recognition, it's important that they be small. You can't fit these massive networks on the phones. There are a couple of, no, there's many related techniques, set which we'll talk less about, which are often called distillation, dimensionality reduction. They're sort of feel like a projection or a nonlinear principle component analysis. When I use the BERT language model, often instead of the full BERT, which is really big and has lots of parameters in it, I will use a distill BERT, one where someone has distilled the large BERT neural net model down to a much smaller one that works almost as well, but has far fewer parameters and is useful for that reason. The other thing which we will talk about and you'll experience is pruning, removing nodes, zeroing them out. It turns out, as we will see, that like so much of regularization, it seems to be better to start with a large network and then compress it, prune it, distill it, make it small rather than starting with a small network and training it. So, how do we compress the network? The simplest way, train up your big network, make it so it fits optimally, then go through and find all of the weights which have a small absolute value, below some threshold, maybe the lowest 20% or the lowest 80% of the weights in size, zero them out, hold them to zero, and then retrain the network having those weights zeroed out, and often you'll do this iteratively, run this 5 or 10 iterations each time zeroing out the smallest weights. Amazingly, you can often zero out 90% of the weights with no important loss of predictive accuracy. You didn't need 90% of the weights to start with. Well, could you have just trained a smaller network? And the answer is, well, sort of and sort of not. And there's a cool recent paper, we're now up to 2019, viewing pruning as a lottery. So, what do they mean? Well, what do they do? Start with a network, train a big network, prune it, prune it so that you're still keeping most of the performance, right? You're only losing a little performance. Now you've got a bunch of zeroed weights. Go back to the original network you started with, but zero out the weights that you pruned after training, right? So, you ran the training, you did the pruning, you got a new network, now you learn only which weights should be zero. Go back to the initial network, train this pruned network with these zeroed weights for the initialization, keeping them zero. It works just as well, they're better. So, we can use pruning sort of as a search to find a solution in this big space, which is one which has lots of weights zeroed out. It's a good search technique. So, note that if you knew in advance which weights you should have zeroed, you wouldn't have had to start with the big network. You could have started with a network a tenth the size, but you didn't know in advance which weights should have been zeroed. So, one way to look at this is that the large network is a whole bunch of different networks, subnetworks, right, different ones zeroed out. It's part of a large space and there's some random initialization of zeroed out weights, which is the magic lottery ticket, it's the winner, it's the good network, and there's lots of other weights you could have zeroed out that don't do such a good job, right. So, if you can figure out which weights to zero, for example by this iterative process, you can find this small network, but if you don't do that, how do you guess what the initial network should look like? I think this is something that we see over and over in regularization, which is that if we knew at the starting point what local optimum we should converge to, if we knew which weights should be smaller, which ones should be zeroed, we could have started with a smaller network and gotten there fast, but in fact what we've done over and over this week is start with a big network and then shrink it, gradually move down, do a stochastic gradient descent, zero pieces out with L1 regularization, do something that takes us to the network, which actually then generalizes well and that's almost always one that's had lots of shrinkage, for example by zeroing out most of the weights.