 dropout is another widely used regularization technique. In dropout, for each mini-batch, we randomly select a fraction p of the nodes, usually a half, and temporarily remove them. The rest of the neural network is treated as the neural net for that mini-batch, compute the error, the gradient, update all the weights, and then put back in all the nodes and the weights that were hidden. Repeat the process. Do that for every single mini-batch at the end of the day when you have converged, or partially converged, then take the final network and shrink all of the weights by a factor of p, because during all the training you only had half of the weights coming in. Now you've got twice as many weights going in, so divide them by two. That's it. That's dropout. Quite clever. Dropout, in theory, has an effect similar to sampling over exponentially many different networks. One's throwing in and out lots and lots of different network structures, so you're sampling over all structures. You don't keep a separate copy of them, but you do add them up in effect, and this ends up being like an ensemble of a bunch of networks, which of course has wonderful properties because ensembles always work really well, and it tends to get the network out of local minimum. More precisely, of course this adds lots of noise. Every mini-batch is going in a different direction, and this is nice because it prevents overfitting. This is a really strong form of regularization. You're not going to get stuck in some local minimum because you've just dropped half of the weights and tried something different. It also forces the neural net to spread the gradients across different weights and nodes, so it can't get too stuck. It gives a distributed representation that says that the nodes are being used, the signal can't be stored only in this one node because half the time that node is deleted. It's also the case if you look at standard neural nets. When they converge, it's often the case that lots of the nodes are dead. What do I mean by dead? No matter what inputs come into these nodes, the output's always constant. They're doing no processing. You could delete them with no effect. They're not being used. In fact, later on this week, we will delete the ones that aren't having a big effect, but note that local minima typically involve lots of neurons that have effectively no function. If you do drop out, these can't stay dead because they're being bounced out and the ones that were taking all the weight before now are functioning and doing something. As I said before, in a way, which we're not going to cover unfortunately in this course, by trying these radically different networks for each mini-batch, we're sampling over something which is an ensemble of many mini-networks and averaging them merely by keeping one weight vector, this is quite magic, in a way that gives nice, accurate, generalizable, regularized predictions.