 In the previous video, we saw that often, especially when you have many, many predictors, you want to find a balance between the accuracy and simplicity of a model. So our goal then should be to, given a model M, given a family of models M, to have a way to measure the complexity or simplicity of that model in such a way that we can balance it against the residual sum of squares error of the model. So this is the time to talk about the LP norms. There are actually many ways to measure lengths in a vector space. You're most familiar with what's now called the L2 norm, the Euclidean norm, or the Frobenius norm. This is where you take the components of your vector, square each one, add them up and take a square root. It's familiar since grade school. The level sets of the two norm are what we think of as circles. But in fact, you can imagine doing this process for any exponent p, take each component of the vector, raise it to a power p, add them up, and take the pth root of that sum. As it happens, that calculation respects the key properties of a distance function. Namely, it's zero if and only if the original vector is zero. It always gives a non-negative number, and it satisfies the triangle inequality. Those are easy things to check algebraically. For our purposes, there are two other norms, aside from the two norm, that are particularly interesting. The first one is the one norm. In fact, the one norm people usually encounter by accident, because the first time someone tries the Euclidean norm, say in eighth or ninth grade, they usually accidentally distribute the square root incorrectly and end up with the one norm. The one norm is simply the sum of the absolute values of the terms. So if you have a vector like 3, negative 4, or 5 in three-dimensional space, the one norm is 3, plus 4, plus 5. Now the level sets of the one norm, the circles, if you will, actually are diamonds, or rotated squares. For example, on this picture, here's roughly, so here's a square, right? So if I take the first term in a two-dimensional vector to be 2, I can trade and go to 1.5, 0.5, or 1, 1, or 0.5, 0.1, 0.5, or 0, 2, and travel along this diagonal line, and of course I can do that in any direction because of the absolute values. So these diamonds, these rotated squares, are actually the circles of the L1 norm. Another very interesting norm is what's called the zero norm. The zero norm is the limit as P goes to zero of the P norms. If you look at that carefully, what it's really saying is it's telling you simply the count of the number of non-zero terms. So if you have a vector in say five-dimensional space, 3, 4, 0, 0, 5, the zero norm of that is 3 because there are three non-zero terms. Now as it happens, the L0 norm is a very good measurement of simplicity. After all, the more terms you have that are zero, the better. So having a small zero norm means you only have a few non-zero terms in your vector. That's great. On the other hand, it's actually very hard to compute effectively with the zero norm because it's such a discrete object. You can't use calculus to your advantage to solve anything. So this is where we see how the L0 norm and the L1 norm interact, and this is because of the curse of dimensionality. So let's take a brief digression to the curse of dimensionality. The curse of dimensionality is a simple geometric computational fact that demonstrates that human intuition is dead wrong when it comes to geometry. Our intuition for how space works, how big space is and where things are, is based on millions of years of evolution in three-dimensional space looking at often two-dimensional surfaces. So if we're actually dealing with five, six, seven, or 10,000-dimensional data, your intuition is often dead wrong. Here's an example of what I'm talking about. This is a picture of a one-by-one square with a bunch of points randomly selected from it with a uniform distribution. So imagine the uniform distribution on the one-by-one square. Now take an inscribed square, which in this case is 0.05 from the boundary on each side. So it goes from 0.05 to 0.95. Thus the red square, the inside of it is 0.9 by 0.9. So the percentage of data inside the red square is most of the data, 81%, 0.9 squared. In fact, I could draw the one-dimensional version of this if you simply had a line and selected points uniformly from the line. You have a 90% chance of choosing a random number between 0.05 and 0.95. So that's great. We think of the interior of the square as having most of the data. That's certainly true. In three dimensions, it's slightly less true. If you take a one-by-one cube and you inscribe a cube of side length 0.9, only 72.9% of the data is inside the smaller cube, because 0.9 cubed is 0.729. That's not so bad. Now consider 10,000 dimensional space. If you make a one-by-one-by-one, et cetera, by one cube, and then inside of it inscribe a 0.9 by 0.9 by 0.9, et cetera, 0.9 cube, essentially zero of the data is inside that smaller cube. This is very striking. This says effectively, and this is one of the classic metaphors for this, that if you take a potato in, say, 10,000 dimensional space and you peel a potato, there's nothing left. We're used to this idea that you can remove the boundary of an object and still have most of the object intact, that somehow the edge, the boundary, is just a little bit, and you can shave it off and still have the bulk of your material left over. That is profoundly not true once you get above dimensions, say, four or five. Gradually 0.9 to the end, or anything less than one to the end, will fall to a smaller and smaller number, of course. And this is a fact that we all learned calculus in algebra that's totally obvious as an algebraic fact, but it has this profound consequence geometrically. This has another side effect. So of course I can only look at the two-dimensional picture here, but if you think about the high-dimensional case, this means that since most points are on the boundary, it means that the nearest neighbors of a point are very far away. So this picture is in two-dimensional space. So say here's a point right here. That point's nearest neighbors are nearby in a small little ball. So the ball in which you're likely to see your nearest neighbors is pretty small. This point's nearest neighbors are right there. This one's a little further out, but there's still a small ball in which it's nearest neighbors lie. On the other hand, if you were to sample points on a similar picture in ten-thousand-dimensional space, probably if you select a point at random, it's on a side. And that side, after all, is a 9,999-dimensional cube, so you're likely to be on the corner of that, which is a 9,998-dimensional cube, and so on. So the points are probably all in the corner. If you pick a point at random, it's probably on the corner. And that means that its nearest neighbors are probably also on a corner, probably a different corner. So in fact, in high dimensions, even once you're above five or six, it starts to be a problem. If you're in high dimensions, if you select a point at random and ask where are its nearest neighbors, how far apart are they? They're often very far away, because the nearest neighbor is probably at closest on the opposite corner, or maybe on an adjacent corner, or maybe an antipodal corner. It could be very far away. So in fact, your intuition that points are somehow packed in and their neighbors are nearby is dead wrong in high-dimensional space. So these two effects are two facets of the so-called curse of dimensionality. It means that as you work in higher, higher dimensions, so in this linear fitting case as you use more and more predictors, your intuition gets worse and worse. Moreover, sampling and finding neighbors and nearby points becomes harder and harder. This curse kills many algorithms that seem totally intuitive in one, two, three, or even four dimensions. But if you really reach honestly big data with thousands or tens of thousands or even millions of different predictors, then it all breaks down. So what does this have to do with the LP norms? Let's flip back to those LP norms. What this means is that the L0 norm is a good measurement of complexity. And remember, the L0 norm has all its concentration on the axes in your space. So any vector which has only an x1 component has L0 norm 1. And any vector that has only x2 done 0 has L0 norm 1. So in some sense, the circle of the L0 norm is the axes themselves. It's a big, it's an x, right? But all the axes crossing. So one dimensional lines, one for each dimension crossing, is the circle in L0 space. However, if you're in high dimensions, consider the L1 circle, which is really a cube. If you select a point at random from that cube, it's almost certainly on the corner. And if you're in high dimensions, that corner is itself a cube of high dimension. So again, if this picture is in 10,000 dimensional space, then one side of this square is 9,999 dimensional. And the corner of that square is 9,998 dimensional. So if you select a point at random from the inside of a cube, from the inside of a circle in the L1 sense, you're probably going to pick a point in the boundary. And on that boundary, it will probably be in the corner. And in that corner, it's probably in the lower dimensional corner. So in fact, if you sample, if you start doing a probabilistic process, or any sort of sampling process, or any sort of measurement process, using an L1 norm in high dimensional space, it's almost as good as an L0 norm, because with high probability, you'll end up on the corners. Again, the cursor dimensionality pushes things towards the corners in high dimensional spaces. So although these pictures look very different in two dimensions, in 10,000 dimensions, they're almost identical.