 Hello and welcome to week five of Deep Learning. This will be a week of regularization, one of the key tricks to make deep learning work well. We'll start by observing that deep learning neural nets typically have almost as many weights as they have observations for training, so they should really overfit, but in fact magically, and we'll try to understand how and why they don't overfit that much, especially if you do the regularization correctly, and the way to make things work well, get low test accuracy, error is to train large networks with lots of kinds of regularization. The key idea behind regularization, which I hope you've pretty much all seen, is the bias variance trade-off. On the x-axis is some measure of the capacity of the neural net to represent information, how complicated it is, how much information it can store. You could think of that as being the number of adjustable weights in the neural net, but if you shrink them with penalties, it becomes a smaller number. If you start with a small network on the left, then it will tend to underfit, it will have high bias, high training error, as you get more and more complex network moving to the right, the bias goes down, the training error goes down, but if you were to train the network on lots of different data sets, it would get more and more different outputs, it would have high variance across training sets, it would have higher test error, overall what we want is to minimize the generalization error, the out-of-sample error, and that goes down at first as the model gets more complex, more powerful, and then goes up. So the game this week is partly to control model complexity, trying to get models that have the right capacity, not too much and not too little. The other piece that's less obvious is neural nets have lots of optima, and we're going to try to find techniques that take us to good optima rather than bad optima. What's a good optima? Something that generalizes well to new observations in the test set. So we've seen deep learning often uses more parameters or as many parameters as it has observations. It should massively overfit people like Jang Benjo, Artrecht and Vinyalsen have looked at various image data sets and shown that you can train to down to zero training error, you've overfit all the noise, you still get a small test error, you can take those same networks and randomize the labels, image in, label out, randomize the labels, they can still learn with zero training error the randomized labels, of course out of sample it's noise because you there's nothing to learn. So they're able to fit these complex functions and yet somehow not overfit horribly, and that's something we're going to try to understand this week. Another example, GPT-3, one of my favorite natural language processing models, we'll cover it later in the semester, 175 billion parameters trained on half a billion, sorry half a trillion words, so roughly the same number of parameters and words, it works remarkably well, it doesn't overfit horribly, but it does seem to memorize. How do we know it memorizes? Well this network we'll see is trained to take in a sequence of words and predict what words will come next. So we can take words that it might have seen in the training set and see what it predicts for them. So for example, if you look at question, what do you call a droid that takes the long way around? Put that as input to GPT-3, what is it output? R2 detour. Okay it probably didn't make that up, it's probably memorized it. So these networks are memorizing data, but they're still not overfitting nearly as much as you might think. So what will we do this week? We will talk a lot about regularization, L1 penalties, L2 penalties, early stopping, which you should have seen, data augmentation, and then we'll talk about stochastic gradient descent, which magically is itself a form of regularization that works quite well, we'll quickly review dropout, we'll remind you what you all know, which is there's a lot of hyper parameters to tune as you decide which of these five, six, seven different regularizations do how much of, and a little bit of what these things mean for generalization. We'll end by talking a bit about how you can compress and prune artificial neural networks, make them smaller, and the fun part, talk about adversarial attacks by altering images or altering text and messing up the classifications and a bit how to defend against them. So welcome to week five.