 The first regularization techniques we'll use are ones that are standard, you have- should have seen before, simply adding in penalties. One can train a neural net to minimize a loss, an L2 squared sum of squared errors loss or a maxant loss, plus a penalty proportional to the L1 norm of the weights, the sum of the absolute values of the weights. This is like a lasso regression. This will zero out some of the weights. You can train a neural net to minimize the training error, plus a constant times the L2 norm of the weights squared. The sum of the squares of the weights, like a ridge regression. This makes the weights smaller and particularly makes the biggest weights smaller. Sometimes people do both of these. Sometimes they also add in something that's roughly akin to an L infinity penalty. They say don't let any weight get bigger in absolute value than some number. The intuition between L1 and L2 regularization can be seen on this picture here. Looking first at the right at the L2 norm, if we have some red curve where we're trading off how big W1 and W2, the two weights are, the place that makes it the smallest weight there is the piece that hits this circle of lines of equal distance, equal size of W1, W2. It's a standard Euclidean distance in L2 norm. That's different from an L1 regularization where with the same curve trading off some loss function of weight one and weight two, we say, how do we make the L1 norm the smallest? The L1 norm gives these little diagonal sort of diamond shaped equality pieces. Think about it. And at this point, the smallest L1 norm on this line is in fact W2 being this value and W1 being zero. So you could see how the L1 norm can zero things out. Whereas the L2 norm is going to try and shrink all of the weights. You can also note that these norms are not scale invariant. As we saw last week, if you rescale the weights, you tend to get loss functions that are more spherically symmetric and these normalizations tend to work better therefore. Give it a try.