 So now it makes sense to say that the data that we get today want to accumulate evidence for what might happen tomorrow. That's ingredient one. Ingredient two is that we don't pose the problem as the problem of recovering a probability distribution, but just be able to make decision, to make classification, to make some kind of prediction about the data, which is why all we care about is finding a function. Given an input, it's not predict the right output. There is no right output because it's a probabilistic relationship, so we have to assume the possibility to make mistakes. And that's why we introduce the loss function. Now we can say that the goal is to try to find the function that will make as little as possible error on the data, but not only the one we have today, that's what you can do in practice, but what you really care about is the kind of prediction you can do on future data. So that's why you introduce this quantity. This is called the expected risk. It is the average error on all possible data. And as noticed yesterday, this is weighted with respect to the probability of getting a point. The expectation is a weighted average. It says, if a point is more likely to be simple, it should pay more for mistakes on it. So as I said, I like this slide because it's short, but it contains a lot of information. Conceptually, it contains a lot of ingredients, but it's good to get back to it and stare at it a bit. The last point is that, while this clearly is the ideal objective, this is also an unpractical objective because typically, the evaluation of this might be hard or impossible, because you might be that you just have to generate a lot of data or that you simply cannot do it. As I said, in simulated data, you might be able to do this, might be very costly. In most learning problems, you just cannot explore that much data. You can take 10, 100, and a million, two minutes, but you're going to take infinitely many. So what you get at disposal is a set of points that you sample according to the distribution. And again, since they all come from the same distribution, you can somewhat hope that as you get more data, again, as I said, you accumulate evidence towards what's really generating them. So this is the game. It's a game at loss from the beginning, the sense that you will not be able to find the perfect solution, but you will have to work with a problem of stability, of somewhat solving a problem with approximate information. That's what we call it learning. From this partial information, we would like to make some statement about something ideal. So a learning algorithm is something that, given what you have, try to find the solution for this. So learning theory is all about establishing the rules under which an algorithm will solve this in a quantifiable way. How many samples you need in order to solve this with a given prescriber accuracy? So we're not going to discuss this at all, but you've seen bounds, and bounds are about that. That's about quantitative statement about what you can do about solving this problem just given this data. What I was trying to do with these lectures is more like a guided tour of where you build algorithms. And the first one we saw is arguably the most frequent and most popular one, which is what I call the empirical risk minimization. You can call it regularization. You can call it M estimators. You can call it many possible ways. But the general idea is I replace the space of function, which is a bit, the notation is a bit sloppy here. It just means all possible functions for which the integral makes sense. Include this is not something you can handle numerically. And then you replace this set of functions with something you can handle, and you replace this objective with something you can handle. The objective function is replaced with the empirical sum that now you can compute. And the set of all possible functions is now replaced with linear functions. This penalty, you can view as further constraining this to leave essentially in a ball. If you take the point of view of Lagrangian multipliers, you're putting a norm constraint on the possible linear functions you're considering. So to be on this page, you could also consider to minimize this under this constraint. So that is the Lagrangian formulation of this kind of problem. So if you view it like this, now you see not only I'm considering linear functions, but I'm constraining them within a ball. So it's a subset. As we said yesterday, you can say, OK, but linear functions are simple enough. But we saw that if you actually have a lot of dimensions and not so many points, linear functions start to be already pretty rich. But not only that, because what you really did is that, a bit quickly, because really not the story can be made pretty short, we actually saw that you can actually view this as a short notation for a more complicated situation where what you have is really a dictionary. So you replace this with, so you replace this xi with some new vectors of features that are no linear. And now j can go from 1 to p and p you choose. Example here are monomials, exponential, cosine, sine, whatever, literally whatever. And at this point, if you let me put here something like a tilde, that's another notation we could use. We'd just say, oh, this guy was not really the original bit. It's not the original pixel of an image, but it's this, this, and this, and this, and this description of an image. So as I said, I keep this notation because it just makes life easier. And it doesn't really matter. But keep in mind that, for example, sometimes today we're going to draw no linear functions out of this model, essentially thinking that I was really replacing this with something else. Now for this discussion that we start to do today, it's important to remember that the dimension of this x can be really huge. Because if you think about it as the short hand notation for a no linear model, then p can be arbitrarily large. You basically shove under the rug all your ignorance about the model, and you just try to make it large, large as possible. And what we saw yesterday is that we can use a little trick, which in fact is pretty deep, to actually let this even go to infinity. In which case, you're really working with a potentially extremely large model. We did this a bit quick yesterday. This is the peak of what we call non-parametric statistics or non-parametric learning. And it's worth, it doesn't really have, as far as I know, a very precise definition. But it has a running definition, which is kind of useful. Essentially, we talk about parametric models whenever you commit to finitely many parameters a priori that you choose before seeing the data. Or even after seeing the data, but you choose once and for all. Say, I choose to take a polynomial of degree 3. If you check, this is the same as going in so many dimensions that depends on the dimension and the degree of the polynomial. Or you say, I take frequencies from frequency 1 to frequency 100. And that's it. So you fix them a priori. What is the problem with that? You never know how many are important, so you can get the data set today for which 100 are important. And then tomorrow I give you more data of the same data set, and you have to increase it by hand. So what is that you call non-parametric? You call non-parametric a model with a priori. You choose to live in an infinite dimensional space. And then somewhat you set up a procedure that is going to choose the dimension of effective dimension needed for the data set in some smart way. So if you remember yesterday, the key trick was that this expression can be further rewritten. W could be rewritten this way as a combination of the input points, which means that the f of x can now be written as this stuff. So this is the normal formulation. So yesterday that this w can also be written as a linear combination of the input point, given some weights that are called ci, which means that when I plug these into the function, I can write it this way. That was the story. And we spent a bit of time convincing ourselves that as long as this series has a finite value, I'm able to actually make these computations. This in a nutshell was the end of last class today. Here I'm pointing out the fact that I still have a parametric. If you look at what comes out of this story, it's that I choose an infinite dimensional model. But thank god, in practice, I only have to find finitely many coefficients, the ci's. So what this terminology means, I say you choose finitely many parameters. That's parametric. But if you choose infinitely many, it's non-parametric. But here it's finitely many. So why would you call this parametric or non-parametric? Well, because here I choose an infinite dimensional model. But the algorithm, the minimizer of my problem can always be written in a parametric way. But the number of parameters I don't choose is the number of points. So if you give me 100 points, I get 100 parameters. If you give me 200 points, I immediately get 200 parameters and keep on going. So the law, and when the number of points goes to infinity, naturally, your model can now explore an infinite dimensional space. So in some sense, it's this adaptive aspect where you choose a priori infinite dimensional model and then it grows in it. Of course, you could also say, oh, can I just take p as a function of m that you choose somehow? You could do that. But this does it automatically. And this is why we call it non-parametric. And there are a few examples of these kernel methods, which is this stuff, is an example Gaussian process, which is basically the same stuff revisited from a Bayesian perspective is the same model. And there are really a few other models, local models, like nearest neighbors are non-parametric. And things like Dirichlet processes and Bayesian statistics are other examples. But it is a handful. Most things we do are parametric. Anyway, so this was basically the end of yesterday's class. We stick to this guy, and then we did a quick extension into the non-linear world by looking at linear features and kernels. Today, we're going to just stick to the linear model. Just remember that sometimes you can think of it as a short notation for something more complicated. Yes? Sir, you want to use it? No, so no. You're right. So he's asking, what you mean non-parametric means that you have no parameters. No, no, no. You have a parameter. And for example, say here, we'll still have lambda. And even if you take a Gaussian kernel, you could have the width of the kernel. So it doesn't mean that it's parameter free. It just means that the model, so now the model isn't over as a lot of weights. It depends what you mean. But let's say, in this case, it means that the function class you consider is infinite dimensional. So if you think of the function as the main model you're trying to find, that model is infinite dimensional. Now, if you want, in this game, I think, at least in these kind of approaches, you make a distinction with what am I called the parameter and the hyperparameter. It's a finite distinction because they are all parameters. But in some sense, in this approach, you make a substantial amount of effort to try to make the problem of finding the parameters convex, nice, and with guarantees. And then you live with the fact that this hyperparameter that you try to make to be 1, 2, 3, you're going to do some cross-validation. You split the data in two and try to look at one half to pick the other half. That's typically what you do. And it is a lot in contrast what you do some other times. So people like me are very concerned about this. I really try to keep the number of free parameters as little as possible and try to find guarantees. And these limits sometimes because you could say, what if I put one lambda for each dimension and then I just optimize? It makes sense. It's very hard to give guarantees on that because now you have to make a big non-convex optimization problem. But because, for example, in Bayesian approaches, this is quite common. You just add more parameters and you put the prior on it and you marginalize out or you do whatever you want. And there's nothing wrong with that. It's just the amount of guarantees you can provide in terms of, say, how far I am from the optimal solution. It's much little because you're getting into more complicated problems. So here we're taking kind of a conservative approach. So back to your question, the parameters that go to infinity are this. We will still have to fix tuning parameters that govern the size of the space. OK, cool. So this is more or less a summary. So hopefully, at this point, you have the feeling that this assumption is not huge. As I said, the only really difference is going to be if you actually consider neural networks like things because the math is very different. If you stick to this kind of model, the math is just the same. Now, the game I want to discuss is one of the nice things about this is that it can be easily generalized to different loss functions and different norms. At least conceptually, the algorithm is the same. You put here a different loss function, say, cross entropy, logistic, hinge loss, whichever you like. From a conceptual point of view, you don't have to add anything. It's the same story. And here the same. If I put in this norm, you put another norm. You put some other prior, in whatever sense you want to mean prior. You put some bias towards certain solutions. But it still works fine. What's going to be the difference? Mainly the computations. If you take the square loss and the square norm, when you take the derivative here, the square goes away, the square goes away, and then you get the linear system. If you take another loss function, you don't. You typically get a nonlinear equation, and typically you have to solve it by gradient descent or some iterative solvers or optimization solver of some sort. But so from a certain perspective, you can generalize this to quite a bit, to different loss functions, different norms, and different function class, and you get the same principle. You just have to specialize on the kind of computation. And what you see implicit in what I just said is that you somewhat split your brain in two parts, the modeling part and the computational part. If you derive this principle, if you start from here, you just say, OK, from now on I don't worry about how to compute things. I just try to find the good principle. And the principle is minimize the empirical objective over a manageable function class. That's all. There is nothing computational in this. There is nothing numerical in this. And then at the second step, you say, OK, fine, I did that. That makes sense. I can even study how far is this from the original problem. Now I close that part of my brain, the statistician side of my brain. I open the optimization side of my brain and say, how do I solve this? How do you solve this problem? If you enter now from the door and I ask you, is this an algorithm or is this a problem, what would you say? Depends who you are. It's like a test that you would say, more of a statistician or more of a numeric person. If you look at this and you say, oh, it's clear, it's an algorithm, then you're more of a statistician. If you say, oh, that's a hell of a problem, then you're more of a numeric person. And this shows that in this approach, there is a split of numerical aspect and modeling aspect. Nothing wrong with that. We always like to take a problem which is hard and split it in two parts that we can manage. But to some extent, this also means that all the numerical aspects of the problem are going to be taken separately from the statistical aspects of the problem. In fact, what is the best accuracy? It's one part. If I ask you how many computations you need to achieve it, you'll ask this question in two different parts. And right now in machine learning, this kind of split is merging a bit. But typically, you have an optimization conference where people start and say, in machine learning, that's what you care about. And then you go to another conference and statistics where they say, in statistical learning and machine learning, this is what you care about. Well, I care about both. I would like to know how all the errors I make somewhat interact with each other. And what we're going to see today is somewhat questioning this whole story and say, well, maybe there is another way of doing things. That is somewhat putting together a bit more deeply computations and statistical aspects. And I can try to see that this is one way to go, but it's not the only way to go. If you come more from a Bayesian perspective, then you know that this can be seen as a maximum posteriori of a Gaussian likelihood and a Gaussian prior. And now if you change the likelihood and the prior, you can get all the other cases. What I'm going to present today, as far as I know, does not have a natural Bayesian interpretation. And that's partially why it's less, it's considered more of a trick than actually a principle. A principle to principle. It's considered more of a hack. And you'll see in a minute what it is. OK, so that's what we're at today. We want to take this and throw it away and say, what's left? If you give me this problem, how do I solve it if I'm prevented to use this just for the sake of finding something else? How do we do it? OK, any questions about this? This is more or less a summary of what we said so far. Any questions about this? Cool? So just to think about materials. So most of these classes are online. There are videos. Last class of today is actually three classes squeezed in one. That's why you felt the pain at some point, because you were at class three after 40 minutes. But you can watch them online if you want. So there are videos. And there are the slides of each class. So I can give you the one where I squeeze them together. But I already gave to Erica, so they should be online either already or they should be soon. If they don't, let me know. I also have some set of messed up notes that I keep for the class I teach. They're messed up in the sense that I start rewriting them and then I stop, but I can give them to you at your own risk. Just if it doesn't make sense, it probably doesn't, so don't read too much. So remind me if you don't find any materials, I can give you a bunch. OK, so back to our story. Let's take a step back and see how we get to the empiric-arist immunization algorithm. Well, one way was you start from this, OK? You get this, you start from this least square problem. You think about least squares and you end up looking at the pseudo-inverse. You remember what the pseudo-inverse was? You invert, but you throw away the eigenvalue that are exactly zero. And then you realize that by doing this, you're selecting, if there are more than one solution in the overall parametric case, the minimal norm solution. And then we can agree whether we want to call this regularization or not. As I said, in signal processing, you would call this regularization. But even if you look at the classical SVM, for those of you that know about margin, this is the margin in a binary classification problem. And you can think of this as, among all, the solution find the one with maximum margin. So you can kind of stop here and say, this is already going the right direction. But we already saw yesterday that the problem is that, hidden in here, there is large and small eigenvalues and potential instability once you move the data. So the conditional number that is a problem in numerics become a problem in statistical learning because we are solving this, but really what we would like to solve is the problem where we have more data. Or in other words, we're given this data set, but tomorrow we might be given, but it's a random set. And we would like our solution to be stable with respect to the random selection of points, at least to some degree, which is certifiable. So we don't just need a solution that is good for this data. We need a solution that takes what's good in this data or any other data set that come from the same source and somewhat toss away what we cannot rely on. So that's why we introduced this, because basically it was putting a threshold on the size of the eigenvalues that we can trust or not. All right, now we want to take a different path. And the path start from the observation is that more or less nobody, if you take, again, let's go back to the numerics book, pretty much nobody would do this unless the problem is really small or even this. These are what are called direct solvers. So basically, you take a problem and you just find the solution straight up. The problem of this is that typically you need quite a bit of memory. You need to be able to manipulate all this. And the cost of this operation is typically cubic. In that case, it's going to be p cubed. What would people do? Find the iterative solvers. What are iterative solvers? Gradient descent. So let me write down the gradient descent for least squares. You just, if this guy, you take its gradient, which gives you this expression here. And then what you do is that you update the solution in the direction of the gradient. Everybody's OK with gradient descent? More or less? So it's the simple version of Newton method where you take first derivative to the side of the direction, but then you take a rough estimate of the second derivative so that you don't have to make extra computation. And you go down the direction of the gradient. Fair enough? So we know the classical theory. So the two observations are first that there is a theory about this for a long time that tells you how to choose the step size in order to ensure that you get there, that you really are minimizing this. And for example, in our sense, it would be if you call this L hat because it depends on the data, it will tell you that as the number of iteration increases, then the error converts to the best possible error, empirical, because that's what you're shooting for. You're shooting for the empirical data. So if you put here the number of iteration, and here you put this, so this is going to go down, and this is going to flatten out on the value of the minimum, which can be or not 0. In the over parametric case, this is typically going to be exactly 0. Otherwise, whatever it is. So notice that here life is good. There are a couple of things that you can notice, just a couple of remarks. One is that everything is good because this is a convex problem. It's not strongly convex typically, but it's convex. So it looks like a smile with a slightly flat region here. It means that you get there. One, two, it's not stochastic in any way. The data stochastic, but the algorithm is not. And so it's a descent method. So this goes down without never going up. So during your trajectory, you always decrease the objective. Other methods, we're going to comment like stochastic gradient, they can increase and decrease a bit. Globally, they decrease, but locally, they can go up and down. OK, fair enough. And notice that nothing would prevent to use the same approach to solve this problem instead. Instead of using this solver, I could actually consider a gradient descent solver. The only difference is that right and taking only the gradient of this part, I also have to take the gradient of this part. And that's it. So basically here, you would just add a minus 2 lambda w or something. There isn't just an extra term. You could do that. There's no problem with that. OK, now we want to try to understand a little bit what this iteration is doing. And the spoiler is that this is going to be a regularization just by itself. No constraint, no penalties, no nothing. This guy is doing the same thing that this is doing. So we toss this in the bin. And all we use is going to be this. So the main thing we're going to try to do today is convince ourselves that this is enough to do everything that the other guy was doing. And it's actually a classical thing, but it's not as well known as the other one, let's say. OK, so let's start to do that a bit. Today's class is considerably shorter and less packed than the one yesterday. So the first observation you can do is the following. Look at this expression. You agree, let's assume that you start from 0. Then the first iteration, so you start from t equals 0. And let's assume that the first vector is 0. It means that the following vector is going to be given by this. Now, how big is this vector? This is a vector of size, how big? Well, if this matrix is n by d, the zero vector is going to be size n. And this expression, x transpose, a vector of size c, we already encountered before. It says that basically you can take the rows of your matrix. And the vector you're considering is a linear combination of the rows. So this is the matrix, is n by d. Now, what happens now is that you take the rows. And let's assume for a minute that n is smaller than d. Now, if you take the rows and you take a linear combination of the rows, are you able to span every vector in the space? No, right? Because n is smaller than d. So you will only able to basically take the space and you'll be able to, if this is rd, you can span some of it. And then you have a part left. I'm going to say something really stupid using five words. Don't look at me even saying something intelligent. It's going to be something stupid for three minutes. So you can take, say, the span of the rows, this stuff. It's still written here, if you want. This is the feature map version. This is the normal version. And then you have the rest. You have the rest. What's the rest? It's the null space of the matrix. It's whatever the matrix see and send to 0. So we are just doing the classical orthogonal decomposition of a space. You give me a matrix, I take the span of the rows, and whatever the matrix send to 0. And then I split in the range and the null space. You agree? OK, but if you remember yesterday, we said, if you give me this linear system and I find the solution w hat, if this is a solution, in general you have a non-uniqueness problem. Why? Because you can add anything here, which is such that this goes to 0. And this is still a solution. So what I'm trying to get at is that, in principle, you can see this is going to any vector in the space of rd. But now that you realize that if you start from 0, then this vector is always in the span. In some sense, you're saying that if you look at wt, so this is the span and this is the null space. What I'm trying to say is that wt, if you start from 0, you're going to start here. And then as you update, it's going to keep on staying here every time you see the equation. Because you take a vector, which is a linear combination. And then the next time you take a vector with a linear combination and you do another linear combination, and you start to sum them up. So they're all of this form. So you stay there. You stay in here. You keep on staying here. So you start here. You keep on staying here. Agree? Let it go. Let it converge. What it will converge to. Which of these infinitely many solutions is going to converge to? Well, these infinitely many solutions, they all look the same. One guy here plus a guy there. All these solutions here are one guy in the span of the rows plus a guy here. So the question for you is, among all these solutions, where will gradient descent converge to? To which gradient descent converge to? Among all these solutions, to which one gradient descent is going to converge to. The only one which is here. Agree? I'm going a bit slowly, not because this is difficult in any way, but just to realize that gradient descent doesn't go where he wants. If you initialize at 0 or within the span, you stay in the span. Wait a minute. But it means that gradient descent is not going where he wants. It goes to this. It doesn't solve this problem. It actually solved this problem. That make sense? Why don't we want to draw conclusion, but is the fact clear? The fact just rests on these two lines of math, which is basically just you give me a matrix and you just split in the orthogonal space and then allow space. That's it. And then you just realize that these guys is actually always leaving the span of the rows, and that's it. Is that OK more or less? OK, now let's see what we can say. Well, we had a few minutes discussion where we agreed or not to call this regularization. Well, now we can have the same discussion for this guy, because this guy is going to the same place. So as gradient descent moves around this space, what he actually does is that he tries to satisfy this constraint by increasing as little as possible the norm of the W's. So what he tries to do is that he tries to find a solution that fits this, but doesn't add any of this term. So if you actually see what the norm is doing, it increases as little as possible, because it tries to feed the data, but he also doesn't want to increase beyond a certain point. So that's what the dynamic of gradient descent is doing. So gradient descent implicitly as a biased way to, biased again as a preference in the way he explores the space of all possible solutions. And he just really comes from this observation that x transpose x puts you back in this pen. It can never be here. It's always there. Notice that this doesn't happen, or you don't feel like you are imposing any specific constraint. When you do this, it's clear that you want the minimum norm, because you say, give me the minimum norm. You put the constraint. Among all possible solutions, you define explicitly this electron criterion. Now here, it's happening the same thing, but you don't feel like you're actually doing it. The first time you see it is like, how does this happen? How does he know that I want to go there? What if I change the norm? How would I get him to go to the right place? We don't know yet, but at least we know that this is going somewhere specific. So that's why people call this implicit regularization, because it doesn't come from the fact that you say, I put myself in a ball. I put some kind of explicit constraints. The gradient descent dynamic naturally goes somewhere. More specific than proper, it assumes that either you assume 0, or you take anything in the span of the points. For example, one of the points. You initialize with one of the rows. If you just take anything, what's going to happen? Well, essentially, that will never be corrected. You don't converge, because basically if you take any vector here, this will never touch it in a way. You split the space in two parts. You keep on correcting whatever is here, and you don't touch this part. So this will not be true anymore. Well, this is actually going to be true, but you don't converge to the minimum. That's exactly what this is showing. It's exactly the same as this. Again, the flow really is this is the same as selecting only this guy and nothing in the orthogonal space. That's the first observation. So the minimal nor solution is the same as selecting only this guy and not anything in the orthogonal in the null space. And the observation is that this leaves exactly in the span of the rows. So it can only converge. If it does converge, it will only converge to that one solution. So it's the same as the minimal norm. Well, so again, the reasoning here is you agree that this vector, again, the summary is written here, and here we're just trying to give more of a geometric interpretation of what's written. So let's see if this is better than what I had before. This vector, because of this definition, is going to be of this form, observation 1, observation 2. The minimal nor solution is the only one which is orthogonal to that. But it so happened that it can now split RD into the range of X transpose and the null space. So this solution always leaves here because that's its definition. It's the only one solution. Among all the solution of the linear system is the only one that leaves only in this guy. That's the first fact. The second fact that this converts to one solution, and it converges there. And there's only one there. So it has to be the minimal nor solution. So there is not an intuition other than the fact that this leaves there. This leaves in the span of the rows. I don't know if there's any more. Minimal nor solution among all the minima. So again, any vector that satisfies this is a minimum. You can never reach a minimum because there are no local minima. Every local minima is a global minima in this. It's a convex problem. Every local minima is a global minima. So there is only one. We are looking at objective function that look like this at best. They don't look like this. Because it's re-squares on linear models. So it's convex. It doesn't have to be strictly convex. It can be flat. And that's the source of no uniqueness. So when you take this problem, if this matrix is invertible, there is only one solution. But if you under over parametrized case, there are many possible solutions. Which means that when you get to the minimum, you can have many points. So when we did the minimum norm, we said, OK, among all these points, pick one particular one. This one, which is the one with the minimal size. That's what we did when we did the minimal nor solution. So what I'm telling you here is that green descent is going to not only converge to this, but it's not converge to any of this point. It's going to converge to that very guy. In one dimension, it's trivial, if you want. But again, you also said that if you start here, it's going to go there. That's all it says. Anyway, so the first observation is that just by the mere observation that this vector lies in the span of the rows, we know that if it converges and with an appropriate choice of the step that it will, it will only converge to a minimizer, but really converge to the minimal nor minimizer. So in some sense, it's implicitly exploring the space in a specific way. And then we can agree or not whether we can call this regularization. Well, if you think in terms of max margin, it's going to go to the maximal margin. It's going to go there. This means that if you just let an algorithm minimize the error, it's going to do something crazy. It's still going to try to find the simplest possible solution that fits the data. It's not going to just let the norm of this grow indiscriminately. It's just going to find the simplest solution compatible to describe in the data. So you cannot expect crazy instability from the outset because it's still doing something. That's for observation one. But we are at the stage we were when we introduced this, though. Yesterday we got here. We said, OK, we can call this regularization as the minimal norm. We throw away the zero eigenvalue. But then what did we do? We start to complain, as I remind you. We start to complain about the fact that still this can be unstable because if the smallest eigenvalue of this is small, then I can have an unstable behavior. And now with respect to this, it looks like we made no progress because at this point, we only know that this is the same as the minimal norm. So if the minimal norm was a bad idea, if it's a good idea, we're good. But if it was a bad idea, we still have to deal with instability. That make sense? So now we want to ask, can this guy handle instability in a nice way? And I told you already the answer is yes. So why? Why is this guy producing not only the final solution is good, but all the solutions that I find take steps. They proceed in a certain way that provides solutions that are going to fix the stability in a precise way. That's what we want to see. Before we do any math, let me just show you a simple simulation. We're going to do this for the non-linear models. So we assume that we have a non-linear function under why my drawing are going to be too trivial. So the idea is, imagine what happens if I give you this data set. This is x. This is y. This is a bunch of points. You start from a zero solution. And then you let gradient descent go without constraints. What is it going to do? Well, it's going to start to try to fit the data, fit them, fit them. And then it's going to try to interpolate the data, go exactly through them. If that's a good idea, there's nothing else to be done. If you have enough points and there is not much noise, there might be enough. But if you have a lot of noise and the points are not so many, maybe you have to cure that. And so for example, in this specific case where there is not a lot of noise and there are a bunch of points, among all this solution, which looks good to you? Probably this one. Wait a minute. How do you get this solution? If I give you the algorithm and the data sets, and you can look at the picture, how do you get this solution? You don't just let the algorithm converge. What do you do? You don't go here. You stop. You stop here. So from just a hacking point of view, this is pretty obvious. You start to fit. And when you start to see something that seems to be a bit unstable, you stop. This trick is pretty old. It's called early stopping. And that's a classical trick that you use to train your own network. You fumble around with the step size a bit, and then you let it go. And if you start to see some behavior that doesn't look good, then you stop. How can you look at this if you have a high dimensional data set? Well, you could look at the test set, the test error. If you now plot here not only the training, but also the test, let me call it LWT. LWT is the true expectation. We cannot do it. You could have to take a hold-out set. But I just assume for a minute that you can look at this. How it's going to look according to this picture, can you guess? So look at these in terms of the training error. Bad training error, bad training error, better training error, sorry, better training error, even better training error. The training error is going to go down always. You just fit more and more and more and more. Why everybody told me that this looks like a better solution? Because you're not thinking about the training error, right? You're thinking about? Well, now can you guess what's the shape that you're going to see there? Well, let's see. You didn't like this, so it looks like you're probably going to be high. But you didn't like this, so it's going to be high. This looks better than the one before, so in total it's going to go low. This looks really good, and this starts to be bad again. So if you look at the behavior in terms of training, it's down, down, down, down. If you look at the behavior between the test, it's going to go down, down, down, and up. Why? Because in some sense, you start to hit the instability. You agree? If you want, you can stop here, because you're just saying, well, if converging is a good idea, I'll converge because I already know that I converged to the minimum norm. Otherwise, I'll just stop before. And if you want, the 1980s training of neural network is basically based on just this empirical observation. Yet again, out of this discussion, you perhaps are left wondering, is this a hack, or is something that can ground in some kind of reasoning? We spent quite a bit of time just to find these old business. You can think of it as a spectral filter. You can think of it as a constrained empirical risk minimization. This just looks like a hack, something that I didn't want to do, because my optimization French told me that I should always converge, but now I don't tell him, and I stop the algorithm because it looks like it's a better idea. So the question left on the table is, can we somewhat quantify? Can we somewhat describe this effect in some meaningful terms? Now, that's what we want to do next, and it's the main thing that we're going to do today. Let me say that this kind of observation, this kind of observation we describe, are becoming very fashionable today. And one reason is because it seems that they can shed light on some of the problems of deep networks. That's one big reason. When you look at the behavior of deep networks, what it does is that even if you turn off everything that you thought was regularizing, it still doesn't go nuts. Probably it's black magic, right? Or there's something going on. Maybe it's because the optimization that you're using is doing something. It's driving you in some region of the parameter space where something is simpler than something else. It just doesn't let you go. So even if you're overparameterized, that's what we've been doing for two days. We're doing overparameterized model where we just chop off and we go simple. Now, in the case of least squares, the math can be made simple. So you can really understand what's going on and touch it. And we don't really know yet if for deep networks this is the case. But at least this gives you a refreshing thing that you ask a question. You go in a simple case, you see how it works, and then you move on to see if you can generalize at least the principle. The proof might be different, but you can even just do experiments to see if that's true. In this case, it's really true. So in this case, you can have a theory, you can have the experiments, and you move to deep networks. You can see the experiments like that, but then you have to somewhat sweat much more to actually get the theorem right, and we still don't know how to do it. So this whole idea of implicit regularization, one of the reasons why it became popular recently is because there is this kind of question of what happened when you'd use a highly parametrized neural networks, and you see that it fits the data, but it doesn't quite overfit. In that sense, notice that this picture, this is the extreme case where the data, it's a difficult problem where you don't have a lot of data and there is a lot of noise. For example, if you take a problem where you have two classes and they're separated, or decently separated, you don't see this shape. You see something that looks more like this. So you have an elbow, it becomes completely flat, because essentially, it's still true that this point is not the same as this. And that's where you want to stop. So this shape, I was showing this the worst case scenario. In practice, you might see something different. But the idea is still the same, that the behavior of your optimization for training and for test is different. And this is a completely new idea from a notation perspective. Inoptimizer, just look at the final objective and try to optimize this. Here we're doing something weird. We define an objective, we try to optimize that, but then we actually really look at something error, the test error instead of the training. That's completely new. It's a statistical view on classical optimization. As I'm gonna comment in a minute, this idea is not new. It was observed more than 50, 60 years ago. The idea that an optimization solver has a self-regularizing effect that the number of iteration is itself somewhat governing. The simplicity of the solution is an old observation. And in the case of linear system, it goes back at least to the 1950s. So again, it's not particularly new. And the idea that again, you can stop early is also something that has been observed before. Yes. The training error you don't look at. I mean, the life is a bit more complicated, so I can tell you, so the pedagogy question at this level of precision is that you ignore it. You use the training data to train the algorithm to get the W, but then you just, to look at the figure of merit, you only look at the test error. In practice, you probably want to do a bit of both, okay? You wanna try to, you want to have some signal, but that's a more hands-on trick of the trade-like story. But yes, so I think at the first level of approximation, that's what you wanna do. Really learning that's what you wanna do all the time, right? You just take a data you want to do well in the future, okay? Here I'm just telling you, why the hell would you keep on going here? Or why UL you wanna keep on just getting, that we can't discuss separately? There are some, I'm hiding some little complications, but I think at the first level, that's the story. Yes. Not to overfit, to overfit, it's a tricky word. So, gradient descent is gonna go here, right? Yes. Now, close that part. Do you like the solution or not? Well, it depends, right? If this is crazy, I'm stable. If, you know, Python says your nuts, this matrix is completely conditioned, your heat instability, that's what I mean. Gradient descent goes there, and that's a good place for a game, but if that's a bad place, you need to fix it. That's all I'm saying, okay? And the question is, how did we fix it before? Well, we introduced lambda, and we let lambda do something for us, okay? Now, through this drawing, it feels like the number of iterations is doing something. Is it doing the same thing that lambda was doing? Can we explain, can we put a relationship between, can we try to see if there is a relation between lambda and t in some sense? That's what we wanna do. Oh, Pipe, let me see what other crap I wrote here. So, I just wanna remind you a couple of things that you're gonna use, okay? Just because if you see them all, I don't go for those slow enough, I don't wanna be sure that everybody is okay with this, because it's the only thing we're gonna use. If you take this with a smaller than one, what'd you get? Well, actually, a is smaller than one, so it's a positive number, what'd you get? One minus a to the a minus one, okay? Well, we, half of us don't remember, I always forget it, but that's the one thing we're gonna use, so we just wanna put it there once, okay? Now, the two things I wanna do is just say, okay, what if instead of this, I write one minus b, that's worth our time, you get this, okay? That's trivially true. The thing that is a bit more interesting is, everything works, if instead of numbers, you take matrices, okay? Let's assume that a matrix is invertible, you take a matrix a, and you take a matrix a with norms smaller than one, and then you get this, okay? Similarly, if you take a matrix b with more than one, you get this. It's called the Neumann series. The geometric series for numbers also works for matrices, okay? Okay, let's take this, and let's manipulate it one second, okay? So I'm gonna reorder all the terms with the w, so I get w's minus gamma n x once plus x, w t minus one. So I put all the terms with w together, okay? So I'm left with plus gamma n x one, suppose y hat, check on me because I made mistakes, typically in this case, okay? And the claim is that if you take a look at this, okay, and if you play a bit with recursion, you can convert yourself that the solution at step t can actually be written in a particularly simple way, is a sum of this power of this matrix applied to this vector, okay? How do you do that? How do you prove it? If you want, you can prove it by induction. You just take this for t minus one, and you plug it in here. That's all you have to do, okay? So I skip this because I don't wanna bother you with calculation, it's not interesting, but if you want to fill in the proof, just take this expression for t minus one, so you get a t minus two here, you shove it in here, you see that this guy's gonna raise the power, this is gonna set you back from zero, and that's it, okay? It's very simple, there's nothing in particular. So if you don't care, you know what I just said, if you care about it, just do that, okay? This is gonna raise the power, and this is gonna get you back the index to zero. So I just write this so that you can convince yourself that this is not unconceivable, okay? And the proof is one second to write. All right, so let's stare at this. Does it look like anything written on the board? Almost, yes, is this guy, thanks, okay? Okay, so as again, does it look familiar? Yes, I actually wrote here something that looks very similar. What's the main difference? Well, everything is applied to this vector, fine. But the only other difference that really matters? The range of the sum, in this case, goes from zero to infinity, whereas here it goes from zero to t minus one. All right, I think a second about that, okay? Well, first of all, if I put from zero to plus infinity, which I could, I just let the iteration go to infinity, what do I get? Well, this whole thing is gonna become x transpose x to the minus one, and then I still have this factor here, which becomes n over gamma, that goes away with this gamma over n. So I get x minus plus x, can you see from back there? No, right? So if t is large, then this thing here becomes the inverse. The gamma n here and the gamma n there goes away, and this is what I get. I'm not doing yet. I'm not doing yet, that's the next comment, okay? I'm cheating, I'm giving you a result, and then why is this true? How do you have to choose this gamma in order to be sure that this is happening? How can you, well, I didn't say, but if you want, I told you, oh, there is this rule to choose the step size, blah, blah, blah, blah. Well, no, you can tell me, and he probably answered his question. Everything I said here is provided that the sum, the series does converge. Well, it's written here. What do you have to do for the series to converge? You need b to be smaller than one. What's b? This. So what do you need? Well, you need the norm of this to be smaller than one, which if you just check, just means that you need to look at the largest eigenvalue of this and to choose this guy accordingly. And this gets you back to classical second order condition on the constant step size for gradient descent. Just recovering the more of an algebraic way. So the classic choice of this step size is that a constant is enough. Don't be fooled by this n. This n is here only because we were minimizing the average error on your data. So it's nothing to do with the step size. It's like, I should really write it here if you want. It's a normalization of my data. So you're right, this is not true. If I choose any step size, I have to choose the right step size, and this gives you back the thing that we somewhat could have assumed implicitly to have chosen. Okay, so this is the large TKs for the proper choice of the step size. Is that surprising? Well, really, I mean, we just read this, the explanation is perhaps a little interesting because it's new, but we just rediscovered that when t goes to infinity, gradient descent converts to the minimum or solution whatever it is of our system, okay? So from that perspective, we didn't find anything particularly new. What happens if we take T extremely? So let's take T, which is not that large, okay? Then what do you do? Well, this one, I need help. So this become one, let's do the number first. One minus A, and now I want to put here T minus one, okay? So I don't have the exact sum. I have to do the, I don't have the exact series. I have to do the sum. If I'm not mistaken, this is what you get. One minus A to the T divided one minus A, okay? And here you get one minus, one minus B to the T. Let me write it like this because this works fine with matrices, okay? Now, this is the case of B. If I make it into a matrix, I get the same stuff with the matrix, okay? WT, this expression, if I now use this partial sum, I can also write it explicitly and it's gonna become identity minus is this part, okay? And then I get, trust me, okay? If you're following, check if they make mistakes otherwise they should be okay. Just notice that, again, there is the gamma N in front goes away because of this inverse here and then you still get this gamma N here. This is what you get, okay? What happens if you take T equal to one? If you take T equal to one, this goes away. This matrix, see, it's inverse and you roughly get this, okay? Up to a rescaling. All right, let me see if you've ever seen this before. So let me summarize the facts up to now. We did a bit of series and sums, but we now get some facts and we can analyze what they are, okay? If you do gradient descent, you're effectively computing a series in a recursive way, okay? By running a recursion, you're computing this series. You're actually not computing quite the series. You actually, when you stop at time T, you're computing up to the t-th term in a sum, all right? If the sum gets very large and you look what you're computing is the inverse, which you kind of already know from the other reasoning, but here you see it from another perspective and this also tells you how you have to choose the step size in case you didn't remember from the previous result. If you take T, not that large, you get this, okay? Which, sorry, if you take T not to infinity, this is what you get for any T. In particular, if T is equal to one, then you get something proportional to X transpose Y, okay? And notice that if you take other terms, okay, all the terms are smaller than one, so these are gonna be correction, okay? You add a little more, a little more, a little more. All right, look at this now. This is the other guy we introduced. If you take lambda small, what happens? Well, this term is negligible, so you get X transpose X, T large, lambda small, smell the same. What if you take lambda large? Well, at some point, you can basically ignore this term and you get the one over lambda who cares, but you really get this. You get something proportional to X transpose Y. So at least in the two limits, you see that there is a complete analogy of what happens if you take T large, basically it's like lambda small, and T small is basically like lambda large, agree? So when you build the solution at a large regression, you move lambda and you don't compute one solution. You compute the path of solution, okay? What is called sometimes a regularization path of different solution that as they increase, they start to be more faithful approximation of the inverse. Turns out that this is also more faithful approximation of your data, but if you take lambda, sorry, if lambda gets smaller, you get a more faithful approximation of the inverse and the data, but for lambda very large, you go away and you get a very nice condition number, but a very poor approximation of your data, okay? And here what I'm trying to convey is that T, the number of iterations from this series profile, it kind of does the same thing. If you take, it's just in some sense, lambda is roughly behaving, where should I write it here? It's perfect place. Give or take, okay? So large T, small lambda, and the other way around. So the gradient descent along its iteration, its optimization, is creating a sequence of solution that seems to behave similarly to gradient descent. So a tick of regularization, okay? One is parametrized by lambda, the other is parametrized by T. So if you want, you can actually push this, okay? Into theorems, and we're not gonna do it, but that's, we stop here, basically. Now we're just gonna make a few comments here and there about this, okay? So if on the one hand just looking at the plots where you fit in the data, you can feel that this could be a reasonable hack and a pretty reasonable fix, now you start to see that there seems to be a bit more going on, okay? It does start to feel that you're doing something really similar, at least in the case of least squares. That's why we do least squares. Everything we said here, the proof relies heavily on least squares and the fact that essentially squares are nice because they're linear. So you can do linear algebra and use eigenvalues and all this kind of stuff. If you do too no linear stuff, the business get a bit more complicated. Okay, yep, done what happened? He's a good student. So what we did the other time was that the way we wrote down the T-conoff stuff was on the eigen decomposition and I used the word filtering, okay? The one over S of the inverse was replaced by this expression here. But now you can do the same game, okay? Which is just what he said. You can just whenever you see X put the singular value of the composition and see what happens and this is what you get, okay? This is the filter of the algorithm, okay? And it's a similar shape, okay? They're all similar, okay? So if you take this perspective, this is a filter and it has a very similar shape to this. If you draw it, you can draw it in one, it's just one dimension. You can draw it, they look similar. They're both like low pass filter. They both like large eigenvalues and dislike small eigenvalues. They both have a tuning size of the window of your filter. In this case it's lambda, in this case it's t, okay? And from this perspective, they're extremely similar. So something that looked originally at the beginning of the story a bit like a fix now looks like a pretty legitimate alternative, okay? So I skip a bunch of slides where basically we did the computation that are on the board, okay? So just, they're there, okay? I chose to do it on the board because it's a bit slower. That's it. That's one good point to start to draw a few conclusions, okay? The promise at the beginning of the class is we're gonna be able to toss rich regression away and get something that was very similar, okay? But different in a way. So now we have it. Now we can compare it. You can say which one you like better, okay? And why you choose one or the other. What do you think? Accomplished we have it, now we have to choose it, right? You can say, okay, now I have two things which one would I use, okay? Oh, why that's not okay for lambda? Oh, that's a good point and it's a great track. Let's think a second, that's a good hint, okay? So let me repeat what he said and I think a second about it together. He's saying, well, lambda is a free parameter and you have to choose it. T, according to this story, you can just monitor the test error and fix it this way, okay? I agree with something and disagree with something. So let's think a second about it. The thing that I agree is that you can use this for T, but I don't see why you shouldn't be able to use it for lambda. You can also choose lambda this way. What do you do? You solve this problem every time you want to try a lambda. You have to choose lambda. So you take this problem, you solve it for lambda one, lambda two, lambda three. You take a grid of values and for each of them you check the error on the test error, okay? So you literally, if you put your one over lambda, you should expect a very similar behavior. You can view it from the matrix approximation point of view or the fitting point of view and you have the same story, okay? So yes, it's true that here it feels like we discussed more the choice of lambda, the choice of T, but according to this, lambda and T are one the other. So there's three parameters, there are three parameters, but they're for three parameters in the same class and we can choose them in the same way. All right, so I would put that on the similarity. Side of the story. They are similar filters, okay? They have roughly the same shape. The three parameter, the regularization parameter can be presumably chosen in a very similar way. What else? All right, why? But why can't you do it here? If you randomly select a subset, you can make it small enough that it does, right? You're on the right track, but for the wrong reason. So I think from the modeling perspective, it's not very easy to see the difference between these two guys, okay? I choose the parameter in the same way, they're a filter of the same shape. As I told you, there is no natural interpretation of this in terms of minimizing and empirical risk with something or deriving this from a Bayesian perspective as a likelihood prior kind of thing. I'm not saying that there's not, I don't know how to do it, okay? It's a question for, if you like this, but you're a Bayesian, then try to see if there's one way. But other than that, okay? Other than that, they smell very similar. But if you start to think in terms of computations, they don't, okay? So the first observation is, what is the cost of this? Well, the cost of this is roughly p square n to build this matrix and p cube to invert it, okay? And if you're smart, you can also consider the other version where you have these two guys flipping, okay? And then you have to spend, so this is time, this is time and this is memory, okay? That's the cost of this. Now what if we realize that you never know lambda, but you have to do what Davide suggested. You have to try this for different lambdas and pick one. Well, here the story become a bit more, there is a slight different flavor. Well, in principle, the simplest thing is that you have to multiply this by the number of lambdas you try, okay? If you want to use a fast solver, say, actually let's get a composition of some fast way to solve this linear system, then you have to do it as many times and you have to try lambda, okay? You can also do an eigen decomposition once and then you don't have the extra cost, but the eigen decomposition itself is a much higher, is basically the same order, but a much higher constant. So you check in practice, it's not clear whether that's the best thing. What about this? Well, let's see. If you start from here, if I start from here, I'm crazy because this is an equality, but from a computational point of view, we have to form matrices and take their power. And if you just do a back of the envelope computation, you don't want to do that, okay? It's cubic and you want to do it many times. Well, but this was just a useful equation in a story and in a proof or in an argument, but we started here and you should definitely go here because how much does it cost? Well, let's see. This is a matrix vector multiplication, okay? And it's gonna cost you essentially the size of these two things, so it's gonna be NP. This is gonna be now a vector and you have to do another multiplication, so it's the same cost and that's it. There's nothing else. Kind of, because this is just one iteration. Now you have to do it for as many iterations as you want. So this guy was this cost and now the new guy is NP, NPT and NP. So memory-wise, they are the same, but this guy is now, let me give some space, NPT, okay? T. So the cost is very different. Computation, they are different. One is just cost of matrix vector multiplication, okay? So there are two observations. First of all, the cost of these can be, now, first of all, the cost of these can be much smaller than this, as you can just see by comparison. This is observation one. Observation two is matrix vector multiplication are the thing for which parallel distributed computation makes life easy, GPU makes life easy, okay? You never have to store the whole matrix. You can do, in a sense, what he was saying before, you can somewhat start to chunk the data in pieces and compute a bit of this gradient and you keep on going, then you keep on going, then you keep on going. Notice that you can also do this for the solution, but you will have to compute different solution. That's why I was not sure what you were referring to. You can take these and compute different solutions, then you have to mix them up. Here you do something different. You just decompose this computation by looking at, say, a third of the data, compute the gradient for the third, then you can compute the gradient for the second third, then the last third, and then you. You have to swap memory to do this. You have to read in memory, but still you can do it. In principle, you can solve this for arbitrary large matrices by splitting the computation in blocks. Whereas here, typically, you have to store the matrix in memory somehow. So the computation of this algorithm are better and they only entail matrix vector multiplication that you can easily parallelize. So from a strictly numerical point of view, that's the main difference and that's really the big difference between them. One can, in this sense, scale. That was a very good hint. The other one basically has problems. There is another one which is a bit more conceptually interesting. This computational cost depends on how many points you have. If the problem is easy or if the problem is hard, if the data are noisy or if the data are not, does not inform how much computation you do. You just do this amount of computation. That's it. How about here? Well, T, how do you choose? You choose by looking at the data. So if the problem is easy or hard, if the data are noisy or not, presumably, the minimum of the tester is gonna change. If the problem is easy, you might be able to get it very quickly. If the problem is easy but noisy, okay, well if the problem is very noisy, you have to stop early because in some sense, otherwise you start to mess around with noise. If the problem is not so noisy, but it's quite complicated, well then you have to keep on going because you can find a good solution but it's gonna take you a while. So now conceptually there is something that I find very interesting which is that the amount of time you spend on a data set does not depend on the training aspect but on the test side. That doesn't depend just on the sheer size of the data set you have but depend on the learning problem you try to solve. And this is somewhat open a window, okay? Because all of a sudden you can look at optimization as a statistician, okay? And you say, what is the role of this decent parameter in a kind of a different way? And you can now ask questions, yes? Yeah. The choice of language is connected to how noisy and how. Yeah. So there's also some components. Yeah, but it's only plays a role statistically. So what Anthony is saying, T depends on the condition number. Lambda depends on the condition number. What's going on here? You know, if you think about that, it's fine. You're correct. So there are two comments. One is which is subtlety is in some sense what matters is not the condition number of the data but in some sense how this connects to the future data, okay? And if you see here, I'm not just choosing it depending on the condition number of the data but really what matters is the test error. So that's the first observation in Zabit Marcel. The one that I'm pointing at though is, okay, you don't know the condition number and you have to run the algorithm multiple times. So to choose the right lambda with the condition number with the test set, you have to solve this many times. So they have a very similar, it's yet, what you're saying, is yet another element of similarity. But to discover the same lambda, you have to do a lot of computations. Whereas here, in some sense, it's naturally embedded in the way you explore the space. So it's a very computational way to do regularization and statistics. It's hard to hear to say what is the statistical part and what is the optimization part? Optimization is doing statistics here. It just explores space in a certain way. They're completely mixed up, okay? So that's what I'm trying to say. Just that T controls at the same time the numerical stability and the complexity of your model. In a specific shape of this figure, we could find the minimum of the figure by some elimination. Yeah, but you don't. So there are two things, okay? If you know a lot of this, you could do analytical stuff, okay? For example, you can try to do lambda. So in some sense, the take-home metric here is lambda, but it's any, so this is the hard part of learning, right? So what you do here is that the typical thing you do is that you split the data in two and use half of them to make any guess. So it's very hard to make any guess where this is. If you could do that, life would be very good. That you can do, yeah. Yeah, yeah, yeah, no, that's true, you could do it. But what's your point, what do you wanna get to? Yeah, so, I mean, yeah, yeah, yeah. So I guess what you're saying is here I'm just putting lambda without really saying how I'm gonna search for them, but I could search for them in a reasonable way. Rather than do just a great search, I could do, say, 10 points, okay? Then I start to see that these guys are better than I do 10 points here. So I don't have to do a, if you take 100 values, you don't wanna just take 100 value like this. You wanna take 10, then 10, then 10, you go a course to find, and you know that if you assume that it's roughly convex, then you can go and see what's going in the middle. Still, even if you do that, which would mean that these are not so many, still you would have to multiply this, times whatever is inside, and you can compare this to this, is still much more. This is linear in N, linear in P, and linear in T. So in some sense, it's much more. What he's saying, though, is pointed to a point which could be interesting in practice, well, in theory, because in practice, I've never seen it matter, which is that this is a discrete regularization, okay? You choose number one, two, three, four, five. You cannot decide how to discretize. Lambda, you can decide how to discretize. And for example, you can take a logarithmic scale and so on and so forth. In practice, this doesn't seem to be an issue, but you can, you know, here in some sense, stuff could happen between two steps, or you cannot skip and go from one to a million, okay? So there is a difference. So in that sense, you're absolutely right. Yeah. Yeah, so the next question, if you want is, okay, if from this perspective it's very natural to see why would you want to do stochastic gradient descent because as I said, this one, what you have to do if you have a big, big matrix, okay, is that you can somewhat take this and the composing two, three, four, okay? What you have to do then is that, so you take one third of the data, you compute a piece of the gradient, okay? Then you consider the second third of the data, another piece of the gradient. Once you've computed all the pieces, you update your solution. If you have a huge data set, that's kind of annoying because you have to wait to see all the data before you make any guess. So the natural thing you can do is to also just, at the one extreme case, you just do stochastic gradient descent. This is the extreme case. Every time you see a point, you update your solution, okay? The intermediate case is the mini-batch case which is the one where you say, okay, I wait to see a few and then update, see a few and then update, okay? Which is the reason we want to do. While this case is very well understood, this case is much less understood but roughly speaking the same phenomenon occurs, okay? Here, they interplay within the number of steps and I'm emphasizing in this story the role of the number of iterations but you can actually see when you look a bit more closely that what really matters is the interplay within the step size and the number of iterations. Typically, a finer analysis show that rather than say that lambda is one over T, what really matters is one over T gamma, okay? So in some sense, the amount of, the distance you cover is what matters, okay? So here I choose the step size to be constant so it's completely governed by the number of steps but if you look at something like stochastic gradient, you cannot see that the same thing happened but you can now play with both, okay? So the long story short is that I would say, yeah, the same phenomenon occurs but it's a bit trickier. All right, so we're basically done. I just want to make a few remarks and ask questions if you have any. So one thing is what about following up on her discussion is I think I have it somewhere here. Buh, buh, buh, buh. What about other kind of optimization, okay? So she asked stochastic gradient which is the obvious one in large scale regime but as you know, people talk about accelerated gradient or you might remember there are things like conjugate gradient and all those, okay? Can you use them? Sure you can. Do we know their test behavior? No, because we mostly study their training behavior. So for some of them, we study them and for conjugate gradient, we have an idea of how it behaves and the spoiler is that to some extent here the relationship is roughly that T square is the regularization parameter which means that you can stop a bit earlier, okay? Anyway, so if you want the long story, the take-home metric here is that there are many, many optimization techniques and we only understand the testing problem very few of them, okay? These squares is one, okay? The other ones are more complicated. Stochastic gradient, we know some but there is only a handful of cases, okay? Similarly, we know about, if you now try to generalize this, you can say, okay, what about different loss functions? What about different norms? What about different function classes? And I would say here that again, similarly, only a very sparse subset of this combination has been explored. To me, this is interesting and I thought to propose it to you, especially because you've been seeing a lot of material about deep networks, because I do think that this is a simple case where you see a clear phenomenon and you can explain it well. And now, if you ask me, can you prove that this really was underlying deep networks? I would say no. But if you ask me, can you make experiments to support this conjecture and can you try to, can you convince yourself that this might be at least one ingredient? I would say yes, okay? And we are reinforced by the way that it's not just experiment, at least if I do the simple case of the linear shallow networks on one layer, which is what I did, the zero-layer network, then at least in that case, I know it's true. So in some sense, the case where you do see something that you didn't know before and you can put it as a variable in the game, okay? And again, least squares just make everything easy because it just boils down to linear algebra again. Okay, so the entire course, I play this game just because it's easier, basically, okay? Let me go back to my list of stuff. A few comments here and there and stop me if you have anything. This idea has different names, okay? And there's another proof of this reinventing thing. Gradient descent is what pretty much everybody would call it. It was proposed, again, in the 50s as a way to solve, to do regularization, and there the names is to be land-vabber iteration and people in statistical machine learning have been calling this L2 boosting, drawing a connection with boosting algorithms. It's possible to do such a thing if you're interested. Again, the idea that you want to use iterations in this way, it's old, and the idea that you can stop early to train things and avoid overfitting is, I mean, in emphasizing the 80s, in particular, in the neural networks literature. The classical name of everything I talk about is not implicit regularization. That's a fancy machine learning 2017 name. Iterative regularization is the classical names and, as I say here, there are books about it, okay? So before we invent, we can see that we are not reinventing because there is a huge literature. And for example, if you look literature on new linear inverse problems, they're very similar to neural networks in the sense that they just have a linear dependence on the parameters and then you can see how far you can go with that. This we already discussed. To me, even more than this, my personal motivation for the reason why I like this is really this. I really like the fact that in this algorithm, I don't really make split statistics and the computation, but they're somewhat intertwined. And I really like the fact that I emphasize here that one parameter controls at the same time the statistical accuracy and the numerical cost, the time cost of my algorithm because now, in some sense, I have, no, I don't know, it's elegant from a theory point of view and it's practical because it would mean that I can just do the minimal amount of computations to achieve a prescribed accuracy from a statistical point of view. And these also justify, you often see plots like this where you put training time here and you put test and training there. Why? I mean, that's not justified by anything, but here, if you know that training time, number of iteration, is essentially controlling the complexity, stability of your model, then it makes sense. That's what you do when you do lambda, when you do bias, variance, trade-off and this kind of stuff. So, that's okay. And if you look recently, another name that people have been using is, I've been using computational regularization, other people have been using algorithmic regularization. So to emphasize the fact that this kind of regularization is algorithmic in nature. And that's really more or less everything I wanted to say. I did not discuss extension to other class of functions, but it's more or less the same story as yesterday. You can trivially extend everything I said to no linear features or kernels. You cannot do the same stuff for neural networks, okay? Because just the linear algebra doesn't go through. You can do, as we say, SGD. And the big game in neural networks right now, a lot of people are referring to the following game. You can ask me the following question. This guy is going to the minimal L2 norm. Can you build an algorithm that go to the minimal L1 norm or something norm? Yes, I can do that, okay? Me and a few other people have been working on this. So in this case, we fix the objective, we fix the norm and we try to find the algorithm, okay? In deep neural networks, you play a strange game, which is you fix the algorithm, you fix the objective, and now you try to find which norm you're getting to. So that's the way people are thinking about using this to understand what the networks are doing. Instead of saying, I know the norm, I know the objective, give me the algorithm, you say, I know the algorithm is SGD. I know the objective, the training error, which norm this guy is implicitly going to, and then they go around and chase for the norm with a stick. What's the norm? And it's not a bad idea, because this will give you a way to think about this, okay? Which norm I'm implicitly controlling along my path. It's just that it's very hard, because there's no linear dynamic, and people typically start to study linear networks, and already for linear networks, multiple layer linear networks, things are not that clear. Anyway, so this is pretty much everything I want to tell you today. A lot of this is more blabbering and discussion and hopefully give you something to think about. What we want to discuss tomorrow is this. Right now, we mixed time and statistics, but we did not mix memory and statistics, and in large-scale scenario, that's a real issue. Even you don't run out of time, you run out of memory, and that's a big problem. So it would be very nice if you could have something where the amount of memory you use is tailored to how hard is the problem. That's what we want to try to discuss today, tomorrow. All right, I'm all set.