 That's pretty interesting, you know. If we are over-complete, if we have more data, then we have dimensions, and we span all the dimensions of it. Linear regression is well-defined. We can solve it with matrix inversion. We can solve it with gradient descent, but the resolution will be the same. But if we have less data, if we don't have enough data for that, the answer is undefined, and it will actually throw us an error, depending on the implementation. But in most implementations, it will throw us an error. So, in a way, how we solve that problem, like the mean squared error, we can still make it to go small. It's just that there's no longer one solution all of a sudden. There's lots of solutions to that. And there is a way of thinking a little more precisely about such problems. We're looking at ill-conditioned problems, and there's the notion of condition number that quantifies how hard and how undefined a problem is. So, let's look at an application of that. Let's say we want to solve for Ax is B, which is just linear regression here. Now, the condition number is defined as the limit of epsilon, which is quite small, of the largest value that we can have in that area, of the ratio of the change in F, let's say we have a ring here, we have a value here in the middle. We're looking at, within that, what's the steepest area within that? No, of course, it becomes wall defined as we make it really small, but effectively, we're asking, what is the direction in which we could change Ax most? So, we could change Ax so that F changes most. So, we're actually asking, what is the dimension where this is most fragile and where we have most effect? Now, our intuitions are really good in low dimensions. Our intuition is really bad in high dimensions, and you can play with this concept of condition numbers, and we will often find that it surprises us. So, when it comes to this linear equation, we understand that very well. What is the condition number? It's the largest singular value of the matrix A divided by the smallest singular value. How can we understand that? Well, there's going to be some dimensions where the effect of moving in that dimension is really big and other dimensions where it's really small. And in that sense, this is a measure of how difficult our promise. And it's actually a measure that is applicable very, very widely, which is we're effectively asking how small changes in the Ax that can just happen by noise produce changes in F here. And in this notion, it applies very broadly to solving inverse prams, and in this case, to solving deep learning prams. And deep learning is really weird in lots of ways here. In many cases, we have more parameters than data. So, what does that mean? Say, if we build a big system solving computer vision prams, we might have millions of parameters. Or for some text prams, we might have billions of parameters. And at the same time, we usually don't have millions of data points. And even if we do, the data points tend to lie in relatively low-dimensional manifolds, which means that the condition number of it is really bad. It will be infinity, or at least very close to infinity. In lots of cases, we're generally over-parameterized. So, regression generally depends on how we solve this problem. So, the dynamics of learning all of a sudden matters. And realizing this, that in a way we have this degree of freedom here, that there's not one solution for us, is the key to understanding deep learning successes. It's also the key for regularization, about which we will learn a lot during week five. So, this is why we should really worry about the dynamics of linear deep learning. The algorithm that we use for optimization matters behavior may be non-trivial. So, how does that affect convergence? How do learning curves look in these cases? Why don't you give it a little bit of a try?