 So how much did you hear what I said? Anyway, let's start. It's about machine learning, you might again. So we talk about old stuff. We try to pretend it's new. No. I'm going to try to show you some simpler ideas and show you how they still matter, how they somewhat connect machine learning not just to itself, but to a lot of other stuff, how ideas that we use now were not born after 2012, but they actually go back maybe 50, 70 years. The emphasis is more on algorithmic aspects rather than statistical or probabilistic aspects. So I'm very happy to discuss this offline. I'm going to be around for three days. But the emphasis here is not on statistical bounds. They don't even appear. They just, you see at some point why they would appear. But then the emphasis is on principle to be learning algorithm. We start from the most classical one, which is empiric charismatic minimization. We get around it a few times to understand what people mean by things like regularization and stability and how they connect with classical notion of stability. And then in the next few days, two days, we're going to discuss a couple more advanced kind of perspective on being learning algorithms. Tomorrow is going to be about the idea that when you optimize, you are already enforcing some kind of stability without knowing it. An idea that is fashionable today and is called implicit regularization sometimes, but an idea that you can trace back at least to the 1950s in solving integral equations. On Wednesday, we're going to connect to, again, one of the classical ideas, principal component analysis, single value decomposition, to introduce much more modern stuff, like sketching and random projection, which are the tools we have today to scale algorithm when data sets don't fit memory. So today is a bit, the classes are not that dense. Today is slightly more dense. So as we discussed with Antonio, I know he told me, confirm that your background is quite diverse. Hopefully, the mathematical level should be comfortable for everybody. But the conceptual level might not be. So stop me any time. There is more material than what I have necessarily to present. We can skip here and there. So if you try to make this a bit more interactive, you can survive your week a bit better. All right, so the usual just motto is what machine learning is about taking a system and rather than providing rules, provide data to solve a task together with some principles. And there are many, many perspective on this. You can take more probabilistic. You can put probability at the center of the story and take a more Bayesian point of view. You can ignore probability at once and just view things as purely programming and scaling things and look at computational tricks. You can look at dynamical system. There are many, many point of views. We take the point of view that in the last, I would say, 2025 years emerges one of the main framework to understand the properties of learning algorithm, what is called statistical learning theory, where the emphasis is a little bit more on frequentist perspectives. And I like this framework because, again, I feel it connects to many, many other realm of applied math and applied sciences in general. So that's what we're going to discuss a little bit. We're going to spend a few slides just setting it up. And you'll see that I'm going to simplify it a bit, and it's basically one slide. It's a dense slide conceptually, but it's still just one slide. Before doing that, I'm going to mostly restrict myself to the realm of supervised learning, mostly because we don't really know whatever we're doing when we're not doing supervised learning. Supervised learning, where we have algorithm and theory. Everything else is where we have solutions of problem that we cannot really define. So they're extremely useful in practice, and you have to learn how to use them to solve practical problem. But if you have to start a lecture, it's problematic because you have to start from that thing that we don't know how to define, but then you really know clustering is a fantastic example. Clustering, you know what it means, but you don't know how to define it. So here we actually try to start from the one place where we have very clear ideas on what we're doing. And just to give you a couple of examples to set your mind before I kill you with some equations, text is one possible realm. You have emails, so we're going to look at a problem where you have an input and an output. In this case, the inputs are emails. It can be a spam or not, and the output is whether it's a spam or not. It's like a label, it's 0, 1. In this case, the idea is always that you want to try to convert the inputs into vectors or something similar, something that it can manipulate. And then you want to build the function that go from the input to the output. You're provided a set of examples and you want to try to find out a good rule. This is an example where you can start to turn on your brain while we call it learning. The typical, the pre-learning way of solving this problem was you give some rules. If there is Viagra, probably most likely it's an email. If there is a king and I want to give you five billion euros, probably it's an email. The problem is that then these things change, things adapt. So it would be very nice to just have a way where every once in a while, you click and you delete a few, they go in a folder, you keep a few randomly anonymized, blah, blah, and you put them in the right folder. And then after a while, the system overnight while you're sleeping is saying, okay, let me take a look at these two folders and I'm going to update my rules. But they're no longer rules because now they look at the data. By deleting emails, essentially, you provide supervision. You're like a teacher that says, I believe this is a spam. I believe this is not a spam. That's why we call it learning because there is this feedback mechanism by somebody that provides labels to data and then you close the loop overnight by training your machine. The nice thing about this is that now you can take emails and replace them with whatever and this still works. For example, you could say, I wanna replace of a car and I want to be able to do it automatically. So I'm gonna, again, I'm gonna have a camera that is gonna scan one by one the letters of the plate and it has to decide whether this is, what is it, a zero or a Z, for example. You start and you collect all the letters. In this case, you have not two possibilities, but many. But the idea is the same. What's the input? The input is an image. An image you can trivially think of as a vector if you unroll the pixels and then in this case would be, again, I make it easy, zero Z is just one of two labels. So far so good. So all of this problem can now be casted in a system where you have X and Y, an email and whether it's a spam or not. Another email and whether it's a spam or not. Or an image and whether it's a zero or a Z. I don't use vectorial notation, okay? You kind of have to figure it out. But usually Y is gonna be a number, okay? Zero, one or a real number. And X is typically a vector or some more exotic thing, but it is a complicated object, okay? And the main point is that at this level of generality, this is a problem that appears in all kind of realm from, say, statistics because machine learning in the end of the day is statistics by 50%, but it also appears in, I don't know, function approximation which you find in many different fields. Perhaps especially even at this stage of generality where you didn't define anything is that typically in all these domain, X is not one or two numbers. It is a lot of numbers, okay? So if you take an image, even if it's just 20 by 20, which is a ridiculously small image, it's gonna be, what is it, 400 pixels, okay? So it's gonna be 400 dimensional. 400 is a crazy number. Anything we know about building functions understanding geometry in 400 dimension, it was like, that's a technical term you can learn. In the case of text, life could be a bit easier because they might be a bit dense, but still it's large and it's long, okay? So everything we know about do little drawing, ought to do a lot of little drawing, okay? Because we have two dimensional boards, but they're totally misleading most of the time. You have to take a peak into a high dimensional world we cannot look at and get principle to how to navigate it in a meaningful way, okay? So that's one of the new thing. That a new thing is that we have to handle this computationally, okay? It's not just purely, I'm gonna make a model and hope for the best. No, you actually have to, I say that machine learning statistics 50%, because the other 50% is, I don't know what is this, but I just have manipulated efficiently, okay? Literally half of the people doing machine learning are just concerned with efficient computations. So, and this is partially because no computation are really an issue. If you have 20 points in one dimension, you do whatever you want, but if you have a million points in 10,000 dimension, you don't, okay? You have to do something efficient, you have to worry, and you might have to end up looking at the model that is not the one you want, but is the one you can solve, okay? All right, so these are just two among endless examples that you can think of to start a company and become rich. And this is just, as I said, the wrong, but necessarily low dimensional picture, where the emphasis is on the fact that, again, so this is x, this is y, these are the points you have, you have both the x-coordinates and the y-coordinates, that's what you have today. And 10 was is here is on the fact that that's fitting this point, might be something you wanna do, but it's not the key. The key is that you're gonna be provided a new email and you have to say, is it a spam or not? You're gonna provide a new image and you have to say, is it a zero or a Z? So the emphasis is about, it's not a matter about purely fitting or interpolating the data, but you have to extrapolate, to do inference. So learning at heart is the problem of giving data today to make observation today to make statement about the future. So if you're a physicist, this sounds vaguely familiar. The main difference here is that you give up finding very detailed description of your model and you just say, I'm ignorant about it, but I got a lot of data and I'm gonna just try to fit some new parametric statistical model. You give up some explanation and interpretability on one side to try to find something that can work well. With all the tricky aspect that this might imply, because you still have to be sure that it doesn't just work on this data, it has to really be meaningful for all the data. And that's basically the short story in a cartoonish way. You're given data, input and out, we wanna find a function and the function should be good, not for the data you have before, but to future data. That's it. Now, this level of precision is not enough to distinguish learning from a bunch of other things you find in the literature. For example, what's the nature of this data? How can I even hope to be able to do this? The data must be related somehow. The data that I see already must be related somehow. If you take the page of a book and the first letter in the book as your input output, good luck with that. And the data that arrives later must be related to the one you see today, because otherwise they hope that you can make a prediction about the future is completely meaningless. So this makes sense, but not really. You cannot hope to actually do anything at this if you don't make some assumption on the process that underlying your data collection mechanism. So what you need to do is to say, okay, my input and output are related somehow and have a notion of observation I see today, observation I see tomorrow, and then I can make some reasoning about how to do this. Of course, you can also just start and say, oh, let me hack this around, but that's not what we want to do. So what we do is that we consider the most basic statistical learning setting, and the most basic statistical learning setting basically is completely defined by one ingredient, which is exactly the one relationship between input and output, okay, which we assume to be a probability distribution. We assume that x and y are two random variables and there is a joint probability distribution on x and y, okay? This reflects our uncertainty about a bunch of things. When we pick the x's, we don't really know exactly which one we're picking, okay? If you take a picture of a camp of, say, of a letter of the plate, there can be noise, okay? There can be perspective. There can be that there is a bias towards certain letters than some others. There is all kinds of uncertainty in the way you collect the data. And similarly, the moment you put labels, okay, there might be mistakes, okay? There are mistakes because it's problematic because I'm lazy, because I'm sleepy, because really there is the z and the two look similar and if there is enough noise, you won't be able to describe it. So it's probability you dump it all in. And notice that there is everything, okay? There might be that you compress the image to a low resolution, so they suck and you cannot read them anymore. So there is not an attempt to try to model every aspect, which could be interesting. It's more an attempt to dump everything into something that somewhat reflects our ignorance and we like to call it uncertainty or whatever you want, okay? That's the first and only if you want to love the game. This is really the only ingredient, okay? That's the basic axiom that we put. There is a probability distribution that is gonna be behind our data generation. Then there are a couple of twists which are what the statistical learning theory setting prescribes. The one is you're not gonna be interested to find the probability distribution itself. You could. It's a legitimate question, but that's not what you wanna do. What you wanna do is that you wanna be able to make prediction. That's what we said here, okay? We said given an image, you wanna say what's next. And to do that, you might not need to actually find the whole probability distribution. You just have to be able to find a good function, okay? So you have to define what is a good function. What is a good function? A good function is something that allows me to predict well given x, y, okay? So it's something for which the error I make is small. So you introduce a notion of error. After all, you're looking to make a deterministic prediction within a probabilistic environment. So you're gonna make mistakes. So you introduce a measure of error and then you say that what you really like to make small is the average error over the entire distribution. So this is deep, okay? Because first of all, this is an insane request because you don't have access to this. So it's purely theoretical, okay? Ideally, I would like to do this because this means that I would make the error on past and future data small. So at this level of generality, we can only agree that this is a pretty reasonable goal. If I could look at the future, I would like to make my prediction good for everything, okay? At that point, you can define in two lines the problem, let me not say it wrong. The problem, the mathematical problem is to solve the minimization of this error, which is just a minimization problem, it's a variational problem, but the twist is that you don't have access to this P. All you have access is a bunch of observation from this P. Okay? So you define a problem which is what you would like to do and then you almost immediately have to give up the fact that you cannot solve it exactly, but you have to leave you an approximate solution because you don't have access to this. So the question is why here the expectation over both X and Y rather than fixing an X and taking a Y? Well, if you pick just an X, you just know what you're gonna do on the specific X, okay? And so it's a local measure of error. Assume, for example, that you have a distribution where it's finite, you have 10 Xs, okay? Five of them have probability, which is 80% of the rest, okay? So you're just a two X, okay? And one of them happens nine times out of 10 and the other one times out of 10. Where do you wanna make errors? What I'm trying to say is some images, say some text, might be more likely than some others and you want to pay more for those, okay? If you want to be a face recognition of the two put on that door and there is me, Antonio, and Chris, you can't ignore me well and you don't have to pay much, right? So because there are gonna be ton of weights on images of Antonio on your mates, I don't show up. So from a statistical point of view, you don't wanna put a lot of weights on my face. So this is like how many times you see a certain X? And this is saying if you see it a lot of time you should pay more. If you don't see it many times you should pay less. That make sense? If you fix just one X, you just don't know how many times it happens. You might decide that you wanna pay as much for me, for him, you might wanna do, but you have to be aware of it. Okay? All right, that's it. That's the problem, okay? In the problem, right here, we're actually making a pretty strong assumption which I imagine during the week you might talk about things like dynamical system or reinforcement learning. This is where we stand a step away from that because you're making the assumption that, A, the data always come from the same mechanism that doesn't change over time, okay? So that's the first big assumption we make which say if you put again the camera on top of the door might not be such an insane assumption, okay? If you want to try to predict whether it's gonna be raining or not tomorrow, it's a totally insane assumption. So as soon as your data really has somewhat a dynamical structure, you do this, it's insane. But there are a lot of cases where even when it's insane, it's good enough to be able to make some kind of qualitative statement and a lot of other cases where it's worth just fine. And that assumption we make hidden here is that the process of making prediction will not influence the problem of data collection, okay? So not only the data don't change over time but whatever I'm doing is not gonna influence what I see next. So this is somewhat the big step away from saying reinforcement learning where you have constant interaction to the data generating process and the learner, okay? Which constantly affect each other but it also goes away for things like active learning where you're given the right to actually ask specific points. Say, oh, I saw this, I make these mistakes, give me that point, okay? If you want to learn, say, a step function, after a while you understand there is a step and say give me points there, don't give me points far away because that's the only place where something happens, okay? The reason why we do this is mostly for the sake of simplicity and it's mostly because that's the one place in machine learning where we do have a very clear idea of a lot of things, okay? And most of the rest is as a more of an infancy stage. All right, so that's so far just defining the problem. Okay, now we wanna build something inside this framework. Any questions about this? How many of you are familiar with the statistical learning setting? All right. For the familiar, you see I'm also making a few simplification along the way. For example, I stick to the square loss. You can actually consider other loss function. You can say logistic loss function and absolute value. I don't care. What I talk about won't be really addressing the problem of choosing a loss function which is a long story short. It's probably independent, okay? So for me I'm just gonna stick to that because that's not where the meat of my discussion is gonna be. Yes, in a minute, yeah. So the other point that Antonio's raising is here. So this problem is completely unapproachable for two reasons, okay? Reason one, you don't know the expectation. So if you just try to do the derivative and set it equal to zero, it's gonna be theory but not practice. Question two, even if you could do that, here I'm actually defining overall possible functions, okay? Which is something that, again, you cannot manage it. It's kind of annoying to write in the computer any possible function, okay? Let me also say because it's something that I learned recently and I didn't realize that this may also model situation. So in the classical machine learning setting, you assume that this you don't know, okay? And then you have to approximate because all you have is data. But at least in, especially in physics and more generally whenever you do simulations, you might actually have the case where you know this. Okay? You can generate the data according to some standard model that you like and you believe is correct. You just don't wanna do that. You wanna do something faster that every time you make a prediction, generate the gazillion points. So you first try to generate a bunch of data and then you find some reasonable function once and for all just to make it quick, okay? The problem is not that much that you lack data, you're actually dying, you're completely flooded with data but you wanna be able to make prediction in a reasonable way. So in high energy physics, they can just generate enormous amount of data and the problem is mostly to be able to process them and try to make prediction fast, okay? Either way, one way or another. Of course this changes the game a little bit but not that much because in the end of the day, it's just a matter of how many points you're gonna have but you still have the problem to try to get from finite data a good approximation to the original problem, okay? Now, before we get to try to make algorithm, I'm just gonna make a couple of remarks about what is an algorithm and how you measure its quality is gonna be the obvious way. An algorithm is a map that given the data returns a function which is an approximate solution of this, okay? This is very vague in general, anything is like this. You can have, I'm gonna, typically denote with the hat the quantity that depends on the data. I'm not gonna put the depends on the full data sets not even the number of points but use a hat of reminder to be careful because things do depend on the data. And then, you know, how do you measure the quality of an F? Typically in mathematics what you do is that you try to find function that are close to function. You typically don't look at, you look at minimizers, okay? But in machine learning, the thing that is meaningful is the error itself, is the objective function, okay? Because this is interpreted as the number of error you make into the future. You don't care about two function being the same if they, or different if they, all you care about is that they have the same prediction performance in the future and this is what this means. This is the prediction performance in the future. So you want to find the function today based on the data you have that ideally should have an error close to the best possible error. Can you compute this quantity? Just a sanity check if I've been talking to nobody for, no, because the only one object we introduce this error is the expectation of everything and you cannot have access to this. Still, you can do theory of this. This is what you wanna do. Well, after all, I find the problem by solving this and I want to know if I solved it or not, okay? So I want to know if what I found is a good solution of this problem or not. This is the obvious way, okay? I have an approximate minimizer. I check if I already is from the true minimum. In practice, if you want to try to quantify this, you have to notice that this is actually a random quantity because it depends on the data and the data are not fixed and so you have a couple of different ways of doing this. You can either show that as the number goes to infinity you get the right solution or even better be able to say how fast it's gonna happen, okay? So the whole, it's called statistical learning theory because you actually attach this kind of certificate that have a statistical nature, that have a probabilistic nature, okay? This is what is, convergence is typically called consistency and this is one of the many form of a bound where you look at the probability going to zero, okay? You can rewrite this in 15 different ways. Talk about sample complexity, learning errors and so on and forth. We're actually gonna not gonna talk about these too much, okay? But this is what we wanna talk about. How the hell you build a learning algorithm, okay? And basically what we're gonna discuss today is somewhat the one natural way of doing this based on so-called empiric risk minimization which is kind of doing the obvious thing, which is, okay, what is that I don't like in this? That I don't know the expectation and I cannot work with the spaceable possible functions. So in most we're gonna discuss today what we do is that we replace the expectation with what we can compute the empirical average and respect the spaceable possible function. In my case, I'm gonna replace these with linear functions, okay? That's it. I go from not accessible to high school, which is refreshing. Now, you can complain right away that this is too simple, okay? The linear functions are too simple. But before you complain there are at least a couple of reasons why you don't wanna complain too much. The first one is that if the intuition that functions that are linear are too simple comes from this plot. If I give you five points, okay? In generic position, so generic, it's quite unlikely that you can fit them with a line, okay? That's one intuition you can have from drawing things in one dimension. But you can also imagine that you write down this other set of relationship. If you try to fit the data with a line, okay? You find the line that should fit the first couple, the second couple, the third couple, and so on. You can collectively write this down as a linear system, okay? In N equation and the unknowns. Then it's easy to convince yourself that as soon as D is much bigger than N, or it's not a bunch bigger. Bigger than N, then you can fit them perfectly, okay? So a line is poor, but in high dimension you can still fit perfectly, okay? So if you overparameterize, if you have a lot of parameters, if you have a lot of variables, you have to be already careful that the linear model is not overshooting, okay? I'm not saying anything smart here, okay? I'm just saying something stupid that you should this, that you also should pair with something else, which is basic, which is this. So be careful because one reason why lines are not so stupid is because when you're in high dimension, they can actually fit anything you want, okay? And we see, we're gonna have a certain freedom what we mean by X. You can actually build the features, you can build dimension somehow, so you can increase the dimension of what you do. That's reason one, okay? Still, I'm not trying to say in any way the linear function is all you need. In fact, in a lot of problems you need much more, you need no linear functions, but it turns out that most no linear function we know are built on the basis of this, okay? We're even gonna put some nonlinearity before the line or inside the line, and then we're gonna try to see what we can do with that, okay? So by understanding what's happening here, you can still go a long way in understanding what happens, when even when you consider more complicated functions. And in fact, for at least a big chunk of them, the extension is essentially trivial, and we're gonna discuss it five times at the end of this class, okay? So the reason for this is mostly pedagogic and it's mostly because it makes the notation much less heavy. Yeah? Yeah? Yeah? Yeah. So what he's saying, for now I'm just, you give me the data and I fit, okay? And I fit the data and I try to see what's going on, okay? Now, notice that on the one that you're fitting, but again, this discussion becomes interesting because I'm fitting, but not on all the data, right? I'm not, sorry, on all the functions. I'm not taking all possible function space. I'm taking a relatively small function space. So if it's true that when D is much bigger than N, I can do whatever I want. When D is much smaller than N, I'm back in this situation. So I'm fitting kind of because I choose a model space which is so small that I won't be able to fit that much, okay? So if you see these, they mention as a free parameter that you can move, then you're already thinking about the situation where you can fit or not, okay? But then we're gonna discuss this for an hour, so that should not be a problem. But for now I'm not putting constraints and I'm not thinking about doing this and we're gonna do that in a minute, okay? That's what we wanna do. I wanna think about this a little bit, just warming up a bit with linear algebra, essentially for a little bit, okay? Okay, so literally for the next 20 minutes or so, we're gonna do a reminder of basically linear algebra, okay? That you can easily extend to much fancier functional analysis if you wish, but this is enough for us, okay? We just wanna think about the inner system and go a bit slowly over this kind of reasoning because then we want to pimp them up for three days. So the one observation, it's useful to really go to this kind of vectorial notation that I was using here. So I actually call the data matrix X hat and the output vectors Y hat. So this is just the collection where in each row, you have one of the inputs, so an image, another image, another image, another image. So D is the pixel, say, let's take good images for this simply, so images and pixels, okay? And Y is, is it a Z or not? Then you can, then you can rewrite this in just matrix form like that, okay? Just trivially. And this means that basically we're looking at the least square problem associated to a linear system. There is a linear system underlying everything, which is again, it's gonna give us some mileage because now we can now ask the question about, okay, what do we know about the linear system? Hopefully all of us know a little bit and then you can ask the question of how what do we know is now overlapping what I just introduced before, the story that the data are random and they come from a distribution and the goal is not to just solve this linear system, but you do the linear system to actually peak into future data, which is a new twist. That's not what you usually do. Usually you're given a linear system and you try to solve it. Here we wanna solve it for a specific reason which changes the perspective a little bit. All right, but let's do it anyway. So again, this is a bit of a refresher. Let's consider first this situation where you have more data than unknowns, okay? Then what happened is that you have that the output, okay? As a higher dimension, that the number of unknown that this, so if D is the number of three parameters you have to fit, then the vector can be larger. Generally it's gonna be longer and you cannot just explain it with this space. So all you have to do is to find the closest point here. So you find the W, then when you map here, gets to the point closer to that, okay? That's the geometric meaning of this norm. So you do least squares because when n bigger than d, the solution is to be doesn't exist and you want to find the best possible projection, okay? Now, if you just do the derivative and set it equal to zero, you find this equation. Here there's nothing interesting going on. You literally just take the derivative of the square, you set it equal to zero and you get the linear system which is not surprising because you're deriving a quadratic function. So you get linear constraints when you set the derivative equal to zero. All right, fair enough. I'm assuming that everybody is okay with this kind of stuff. If you're not, stop me a bit. For now, nothing is going on. Now you can consider now the somewhat over parametrized regime where n is much smaller or smaller than d, okay? Now you don't have the problem anymore that this guy can fall outside. Again, I'm generally assuming that you have to take the smallest of the rows and the column and they're linearly independent, okay? So you can actually make life a bit more complicated when I say n is smaller than d, but even the n can be linearly dependent. I'm not gonna do this just because of the simplicity, but it's easy to see at some point where this would make a difference, okay? So keep it there for a minute and we get back to that if it's bothering you. Anyway, so in this case, if you again assume that you don't have the genericity in the low rank of your matrix, then the data is gonna fall within the set of parallel. You have more unknown. So the problem is not that you cannot explain the data, but the problem is now that you might have multiple solutions. You can clearly fit them, okay? But now the problem is that you have multiple solutions, okay? Again, this is a reminder. If x, okay, have a solution w, and now what you have is that there are vectors that generate zero once you apply x. So in the null space, then of course you can now generate infinitely many solutions. You take any of these vector, you add it with what you found, and you find another solution, okay? So you have a problem of non-uniqueness. Now how do you solve this, okay? How you pick one among all the possible solution. The classical way is to say among all the possible solution I find one with minimal norm, okay? Sometimes you can view this as an energy, okay? You can just view it as a constraint. This is the classical way. And this is the 1955 kind of way, okay? In the last 20 years, you can ask, oh, why don't I put in the norm one, norm three, norm four, norm six, entropy, this and that, because we want to keep it easy, okay? Because we want to do something easy, and this allows, it's classical, so we know a lot of stuff. Now, if you make this choice now, you took uniqueness away. So you put non-uniqueness away, and you got uniqueness back in. And now you can ask now, what about the face of the solution, okay? And you can just do a quick computation with Lagrange multipliers to show that the solution of this looks very similar to the one you had before. In some sense, you swap the order, okay? Before, you had x transpose x, x transpose, so you invert the x transpose x, which notice is d by d, okay? And now, it turns out that what the right thing to do is that you have to actually invert the x, x transpose, which notice that makes sense, right? Because you have a rectangular matrix that is either like this or like this, and basically you're saying, depending on which one is thinner of the two, I'm gonna be the smaller square matrix I can, okay? That presumably is invertible, and then I'm gonna invert that. That's kind of returning the natural thing to do, okay? So this problem now is n by n, okay? So you take the smallest between n and d, and you do. So if you have degeneracy, you will still have to take it to do a little bit more, okay? So far so good. So this is my primary linear algebra for grown-up people that don't know what you're doing this, but it's gonna be useful in a minute, I promise. So this is the summary. The main character here is this pseudo-inverse, okay? Here, the dagger stands not for a joint or anything, is either one of these two matrices, okay? It's the so-called Murpenroth pseudo-inverse, is not quite the inverse, because this matrix rectangular might not be invertible, but this is a notion of inverse, which is good enough for our purposes. It's a notion of inverse in the least square sense, or in the minimum norm sense, okay? Okay, so another, there is one last way to look at this, and then we're gonna ask the question whether it's a good idea or not, okay? Going back to the question of what kind of restriction we made. To do that, it's useful to consider the singular value decomposition of the matrix X. Everybody knows what the singular value decomposition is. There's somebody that doesn't know it. All right, so if you don't know it, which is good, because I can tell you now, rather than that. Think of it as, you know what eigenvalues and eigenvectors are, okay? These are for nice matrices that are typically square, and the one we like are symmetric and positive definite. If you have a tangrament, you have a similar decomposition, okay? So if you have a matrix A, you take vectors, and you write like this, so these are vectors, you collect, you know, these are the eigenvectors, and these are the eigenvalues, okay? So I put together all the relationships. This is one eigenvector, another one, okay? Here the relationship is similar, but what you have is that, because it's rectangular, you're gonna have two set of vectors there. You have the singular vectors from one side and the other. Let me see which way I use, they use this. So you're gonna get that you put V here, but this is N by D, so it has to spit out something about dimension N, okay? So you get NN by, what did I do? R. So I rewrote this there, so you can stare there for a minute. So this is the classical eigen decomposition, since this is the singular value of the composition, you put something that looks like an eigenvector here, but you get something which is not the same eigenvector. You have to change the dimension, otherwise it doesn't make sense, and that's the other pair of singular vectors, okay? So in a singular decomposition, you can think of it generalizing the eigencomposition for rectangular matrix, and it spits not eigenvalue and eigenvector, but eigenvalues and singular values and singular vectors, which are two set of vectors. One, these usually are an eigenbasis for the space, okay? Now you have two set of eigenbasis, one for the Rd and one for Rn. One for the set of rows and one for the set of columns. So that's what it is, okay? So it's a generalization of eigen decomposition, if you wish. Why this is useful, because now you can write down the action of any matrix as essentially multiplication by some numbers. You change basis, and a matrix just become multiplying something by some numbers. What? The singular vectors exactly. Particularly, if you do the application of the matrix itself, you can take a vector W, write it on the basis of the V, find the coefficients, multiply by the singular values, and then just multiply everything by the corresponding eigenvectors, the singular vectors of Rn, okay? Again, if you know this, I'm using too many words, but there's nothing going on here. I'm just writing the obvious way, the action of a matrix on the singular system, okay? Why do we like this? Because now we can look at an inverse with clear eyes, okay? Because the pseudo inverse, first of all, notice that here I start to put R, okay? Fixing what I've been putting under the rag up to now. What is R? Well, it's the rank of this matrix, okay? In the simplest case, this would just be the smallest between N and D, but now you can think of it also be smaller between the two, okay? It doesn't matter. This still will hold, okay? And now you can write down the matrix, the action of the inverse just this way, okay? Is the same equation as before, but you see the nice thing is that you can use what is called spectral calculus to give out an explicit expression of the matrix just in terms of an inverse of the eigenvalue. So rather than think of manipulating matrices, you can now manipulate numbers, which we know how to do a little better, okay? So it is the face of the pseudo inverse. Again, nothing going on here. I'm just going briefly over things, okay? So this is linear algebra. Up to now it's only linear algebra. Let me summarize. So you start from the statistical learning problem, but then you boil down to Slee squares. Then you say, okay, but this is stuff I studied in another book. So let's open the other book. You open the other book and you say, okay, I have unknowns, equations. I can have one bigger than the other over the term and under the term. You do a couple of derivatives. You think about it a bit. And it turns out that the central object is the pseudo inverse, okay? This thing that you can write down in two different ways and you can think of as the way to invert the matrix which is not invertible, okay? What do you do? Well, this is one way to think about it. You diagonalize it and then you pick all the eigenvalues that are bigger than zero and you invert them. The other ones you just kill, okay? Notice that here I'm just going up to the rank, okay? Any questions about this stuff? So this is our necessarily mathematical background because what we wanna ask is, what are we doing from a statistical learning point of view, okay? What did we do with respect to fitting the data, making prediction and so on, are we done? Well, first of all, we can make one observation, okay? This is writing everything on a basis but it's not just any basis. It's the basis of eigenvectors of this. Sorry, singular vectors of this. So it's a specific basis. We're not, we're throwing away some dimensions and we're somewhat putting more weights or less weights in certain dimension by these inverse singular vectors. And what are these eigenvectors? Well, they have a name, they're called PCA. They're the principal component of your data and they can be seen, now, if you know statistics, you can view them as the direction where your input data has maximum variance. But if you know math, you also know that this is, if you give me a set of vectors, the best way to approximate the vectors in L2 sense is to exactly take the first whatever eigenvectors of the matrix of these vectors, right? So you have a matrix, you want to find the best R or K rank approximation and you diagonalize the matrix and you truncate. That's it. So these are the direction that retain most of the information in the input data. And in some sense, this algorithm seems to like those, okay, those direction. It likes those direction. Completely unsupervised if you want. They're completed direction in the inputs, okay? And these algorithms to like those because it puts weight on those, okay? And this comes from the fact that we put here these two norm, okay? If you don't put here a two norm, you put something else, you change the game. But if you put there the two norm, everything somewhat comes together in this specific way and turns out that this particular set of vectors is meaningful for this algorithm, okay? He likes those direction. In the space of possible solution, not all solutions are the same, not all directions are the same. The direction with big singular value, you can correspond to big singular values are more interesting than the other ones, okay? That's what this is saying. So from a statistical point of view, you start to put what you might call the bias in the sense of implicit on the inductive bias, not in the sense of getting an offset fluid expectation. So you somewhat put preferred direction. You break the symmetry of the space and you start to say, I like going in that direction better than going in this direction. So implicit, you're making an assumption that's a good idea, okay? It's hidden in this algorithm. If that's a bad idea, you're already lost. If your solution was sparse on the canonical basis, good luck with that doing this, right? Because you're just going somewhere else. So this is the first place where a bunch of innocent decisions led to actually realizing that we actually make fairly big assumption on the model. Not only that it's linear, but in our Mongol linear function, we like some solutions better, okay? So that's the first realization. We are actually doing some kind of belief, imposing some kind of belief implicitly just by doing this simple math. That's the first observation one. Then you can ask yourself, is this enough or there's something that could go wrong? And again, if you think what's written in the book that you just opened, the one about linear system, typically there is a complaint at this point. What might go wrong with this one expression here? The matrix is not diagonal. So you say, the matrix is not diagonalized, but you can take the singular value decomposition. So that's fine, okay? So you can do that. The eigenvalue is zero, you killed, but you're on the right track. So he's saying, well, the matrix might not be diagonalizable, no, it's fine. What is the eigenvalue are zero? They're not because they put the rank constraint to go up to the smallest, which is not zero. But still, the problem is that it's not zero, but it can be extremely small. So in that book, it's written that that's typically not a good idea. Because if this guy is very small, in a perfect world with nothing moves, there's no problem. But if you start to have anything that moves in this expression, this thing can explode and be completely different. For example, if I take here a y, and now I take another y, I can take a w, I can get a w, which is completely different. Okay, that's a problem in the numerically-neurological book, but it's a problem for us. Well, let's see. I give it a set of images, you make a prediction, and then I take an image, I kick three pixels, and you say something completely different. That sounds kind of annoying, right? There is no prediction yet, but it's there, okay? When will this happen? Well, it may be the most likely where I have many of these numbers when they start to have a big sequence, okay? But clearly it can happen, okay? It can happen that if these eigenvalues are small, and if I make some change, then things can blow up. And notice that if I change y, which is the classical thing you do in linear system, this can go wrong. But even if I change u or v, or even s itself, this can go wrong, okay? If I have a small change, the whole thing explodes. Absolutely, yes. When you say small, what the hell does it mean, okay? Typically what you do is that you take the biggest eigenvalue, the smallest eigenvalue, and you look at the ratio because then it's a relatively statement, and this ratio is called the condition number. So if you like the term and if you know what it is, you can say if the condition number is bad, this can be an extremely bad idea. Both of you could try to do it in pi, to no matter whatever you do, and it's gonna say go away. But also because if you change the data slightly, you're gonna get a completely different solution. And that's a problem for learning, not only for stability, from a numerical point of view, revive itself as a statistical problem, in stability in terms of fitting the data too much, and then getting poor prediction in the future. Now, we're not gonna pursue this direction, and we can stop the intuition here, but I just wanna attach a couple of equations to what I just said. In practice, this is what you do. You have least squares, you set the derivative equal to zero, you get this linear system, okay? In theory, in the moment you replace all possible functions with linear functions, you can at least do the computation with the expectation. And it's very easy, and I just give you the solution. There's nothing going on. You just take the derivative, and then what you get is that the expectation of xx transpose, and here you get the expectation of x plus y. Okay, so you get the linear system with this quantity that you cannot solve. So if you do this, you can really do a parallel, which is more a linear algebra like a parallel. In practice, you have to solve this linear system. In theory, you would like to solve this linear system. So it's literally like the case where you have a linear system where both the data term and the matrix itself are subject to perturbation, which is not the usual perturbation. It's kind of a random discretization, okay? Plus noise. But at a high level, this just gives you the picture of what's going on. This can actually be developed and make precise. The connection between inverse problems, linear systems, and statistical learning, which is more of a matrix flavor, okay? Again, you solve this in your system because that's all you can do, but you would like to solve this. So you replace this vector by this, because you can compute it, and you replace these matrix by this, okay? And so the question we asked a second ago is not just abstract this condition number bad. No, it's bad. It can be very bad. It can be that I'm making really an approximation. So I really like to solve this problem, but I'm solving that. So if I have a bad condition number here, this can really blow off everything, okay? And of course, now if you like math, you can go ahead and try to quantify this. You can do random matrix theory, and that's basically what you're gonna show that this is the expectation of this, this is the expectation of this. You do a bit of random matrix theory. You try to put everything there. You can quantify everything I said and get quantitative statement rather than qualitative, but that's not what we're gonna do. What we're gonna do is that you believe me that things can go really wrong, and you try to fix them, and we see the first way of to fix them. I need to drink. Do you have questions? All right, so just in terms of terminology, one of the most abused word ever is regularization, because it's used all over the place with slightly different meanings. Here I want to just clarify a couple of meanings. If you look at, for example, in the literature of signal processing, okay? What we just did is already called regularization. It's called regularization because the moment in which among a set of possible solution, you go and pick the one with minimal two norm, it's like this selection principle, is called regularization, okay? If I convert to the mean on our solution and regularizing, it's fine. I mean, if you agree that that's what we're gonna call regularization, it's fine, okay? But if you look at the classical literature on regularization and inverse problem, that's not called regularization. It's called pseudo solution. It's just an approximate solution. It's another name, okay? But then the name of the regularization is reserved from something else. So typically we don't call this regularization. Well, typically we don't. Typically people in inverse problem don't call this regularization. They call it pseudo solution. Signal processing people call this regularization. But the point is that regularization is also used for something else. It's used to something that is close to this, but has a better condition number. It's more stable. It will incur less in this problem of blowing up if I move things a little bit. So in classical regularization theory, the name is attached to that notion, okay? To the notion of something which is stable. Not that much that is unique and exist, but it's stable in some precise sense. And what we wanna see today is how to achieve that in the classical way. So how do you do it classically? Well, if you want one way to go back to the definition of the pseudo inverse, I showed you the definition in terms of eigenvalues and eigenvectors. There is another definition, which is basically this. Take a transpose z, add plus lambda, and then x transpose. Or do the other thing, you moved x transpose before. The pseudo inverse correspond to the case where you let lambda go to zero, okay? That's the definition of pseudo inverse. And possible idea to fix it is to say, well, what if I don't let lambda go to zero? What if I take lambda strictly bigger than zero? If I take lambda equal to zero, I'm back to the game before. What if I take lambda strictly bigger than zero? Let's see what happens, okay? The spoiler is that the smallest eigenvalue of this or singular value of this is now gonna be driven by lambda, and now I can control the condition number, okay? And by controlling the condition number, I can play with the stability that was causing problems. Now, this basic idea, yes. How would you do that before I answer your question? You kill him? You could, okay? So let me see what's going on afterwards. Give me one second and we comment on what you're saying the next line, okay? So the question is about, this is putting some kind of bias, and it's basically fixing the problem of small eigenvalues could it somewhat cure the small eigenvalue some other way? The answer is yes, and the next slide is very easy to see, but you will not put the problem of introducing a bias away, you just put the bias in a different way, okay, so. Okay, so anyway, so this is the question. Now, this actually comes out of different names, okay? The classical, basically is an idea that seemingly appeared in the 60s, okay? People that solved the inner system and integral equation called this Tikhonov regularization or Tikhonov-Phillips regularization, but around the same time it was proposing statistics to precisely the symbols and it's called ridge regression, okay? And they are roughly the same idea. That's basically the underlying intuition, okay? You can drive this from a bunch of ways, and it's not unrelated to some deep reasoning and statistics related to what is called the Stein effect, which is basically showing that maximum likelihood might not be the best possible estimator, okay? And if you introduce a bias, you get something better. We're not gonna discuss the, but these are just C-point, three pointers. You'll find these ideas in three relatively distance place, you know, integral equation linear system and statistics and Stein effect is actually the cute things to look up if you have never done it. So we want to understand a bit better how this relates to what we said before. We can do it in two way. One is to just do the same trick we did before. We do the singular value decomposition and we see now what happens, okay? And what's happening is that instead of the one over S term, now we get this, this is gonna contribute a singular value and this is gonna contribute a square of the singular values and then plus lambda. So this is the expression you have, okay? So you add one over S, now you have S divided S square plus lambda, okay? Yes? This is bias, right? Is what? Bias, bias, I mean in terms of expectation and solutions. Bias, absolutely yes. So lambda, bias. That thing, yeah, I'm doing that without saying and I'm doing that. If you know what the bias variance trade-off is, it's a spoiler, that's what we're doing. I'm not assuming that you know, I wanna show you why you wanna do that, okay? So one way to say, so one way to introduce this, which is not what I will talk about is to say I introduce a bias and then I trade-off with the variance and I'm just telling you now, well, let's keep on going with the numerical and linear algebra point of view and just view it as augmenting the stability of your system. So we're gonna get there, okay? But the main point is that now by adding lambda, what you're doing, if you know what the bias is, that's what you're doing, okay? And you're reducing the bias and augmenting the bias and here I'm just showing you why, another way of looking at this, which is just purely geometric or analytic, I don't know what you wanna call it. It's not statistical, it's just, I was dividing by the small eigenvalue and now I'm dividing by something else, okay? The ratio between these, the square of it plus lambda, okay? And this gives you a different way to think about this, which is less statistical and is more engineering, okay? It's more like filtering. You have something and you're filtering it out. What do I mean? Well, if you allow me to think of, if you think in terms of frequencies, okay, and Fourier analysis, you have, you know, low harmonics and high harmonics, okay? Now think of big eigenvalues as slow frequency, so big harmonics and to small eigenvalues to high frequencies. Then typically, you assume that high frequencies can be the one that leads to trouble and they give you instability and you assume that you hope, you cross your finger that most of the energy is in the low frequencies. Well, if you do this analogy, that's what we're doing right here. We are basically saying, when you start to fiddle with small eigenvalues, you better chop them off. And this you can view as a filter function. When lambda is very big, it's a low pass filter. You just let this low frequency go. Again, frequency here, low frequency are big singular values, okay? If lambda is very big, you throw most of it away. You just have a very narrow low pass filter. If lambda is very small, you actually allow yourself to go all the way down to whatever you have, okay? So you can view this as a filtering, okay? And if you're familiar with what it's called vener filtering in signal processing, you would not find this unfamiliar. It's basically the same idea, okay? Well, from this perspective I'm talking about yes, the truth is that the thing are a bit more complicated than that. Because you still, you can, you still have the problem of noise and you can see that these somewhat reduce the noise, okay? Even for, okay. You could do the inverse, but you will still pay the smallest eigenvalue. And it might be that if you are choosing lambda correctly, they will typically depend on the signal-to-noise ratio, you can do what is called denoising, okay? So even if the matrix X was an identity, we know, if you give you a vector and I correct it with noise, if I know something about the vector, I may be able to get a better estimate than this that removes the noise a little bit. So it turns out that you can do this with lambda, okay? But I'm thinking of the over-determined system to just talk about one instead of two. But you're talking about two different things. So this is my matrix X, and each row is an image, okay? That's X one, that's X two, okay? You're talking about frequency in here, which is within each vector, okay? An image, a vector in there, and I'm talking about frequency across. So it's more like the statistics of the image of the set of images, okay? So if all images happen to have high frequency inside, this means that this high frequency will be the biggest eigenvector of their matrix, okay? So this is a bit more of a specialist question, but basically what I'm, don't, if you're thinking about image, don't confuse the frequency of an image with the frequency of a set of images together, okay? It might be that if I look at a set of images, the worst possible frequency in Fourier domain is the most likely one, and this can happen here, okay? That's two different frequencies, okay? These are the frequency across the images, and you're talking about frequency within an image, I guess. If you come back to the beginning of the talk, is it the same as saying that you'd say that the images are very far from what you would expect? You kill them or not? In a sense, yes. Yeah, in a sense it says, in a sense it says, you know, if in the image space I have stuff that looks like this, okay? I'm assuming that these are the good guys, and then this, and then whatever is left. So in some sense we have a lot of variability, it's what you like, the way, you know, the direction that I construct your inputs are also the one important to make predictions. And these are, you know, these are, this is the reason why I do this, okay? Because all of a sudden you see that you did some steps, and you're making some pretty heavy assumption when the algorithm is good or not, even without any statistics, okay? You kind of see that these algorithms gonna be only good in certain situations, which are these ones, okay? The one where the direction that gives good reconstruction of data are important, okay? Okay, this is one possible way, okay? Another possible way is to take the variational formulation. So we saw that the Moore-Penrose inverse was the solution of a least square problem. Is this the solution of something, okay? And you can somewhat reverse engineer the gradient and show that this is the solution of this. Again, I write here the gradient of this expression, and it's easy to see. Okay, so now you have the point of view, you can stick to this formulation and take a more linear system point of view, but you can also go back here. And how do you see that? Well, if you want this, see yet another point of view. Now you're taking the, you're not just minimizing the error, but you're now minimizing the error plus a constraint on the W weighted with lambda. How can you interpret this? Well, there are 15 ways of interpreting this, but one way is as a budget on your dimensions, okay? Again, assume that D is much bigger than M, then in some sense you have too many dimensions, okay? To fill around and find the solution. But if I put the budget on the sum of the ways that I can put in all dimensions, which is what I'm doing when I look at the norm, what I'm doing there is saying, oh, this sum cannot be too big. And if I make lambda very big, I'm basically ignoring in the limit, I'm essentially ignoring fit the data and just in the hope that I can find a simple W, simple means where this sum is small. So you can view this as kind of an implicit way to reduce dimensionality yet in another way, okay? Just by shrinking the values of the coefficients. And that's why this thing is also called the shrinkage. And again, this is a very operational perspective on it. And here you just have two different ways to just introduce a way to build learning algorithm. And somewhat, I sneaked in an extra parameter, right? It was a parameter free before and now I have this extra lambda that I'm not discussing, but this lambda is gonna be magic. It's gonna be the one that allowed me to go from fitting, which might be or not a good idea, to extremely stable, which might be or not a good idea. And here is a kind of a summary, okay? So we introduced regularization for a more, say, stability perspective. You can start from the pseudo inverse, which gives you some solution. And you remember that you have this variation of formulation. You have the matrix formulation. You have the variation of formulation. And now you can go to read regression and do the same game. And here, well, forget about this because I wanted to kill it and alert it, okay? There is no one over M. Doesn't make much difference. And again, you can call this regularization in which case you put some bias on certain direction of your space, the one corresponding to the big eigenvalues or singular values. And then you can look at this and in some sense you do that even more because now you put lambda and you can really somewhat enforce even more. You can somewhat kilo the information. Yes, students first. This is the classical notion of stability. This is uniformly distributed the budget of a dimension rather than killing of... Not uniformly, but it's, again, it's doing that essentially. So I don't know what you mean by uniformly distributed. Can you compare it with the one of the one norm? So at this level, if you know about L1 norm, essentially you can read this way. To justify the fact that I add this constraint, I said this is gonna put a budget on the weights. So if you just have two and the first one is much more important than the other, you're gonna try to depress the second one and keep the first one high, okay? That makes sense. I don't let them do whatever I want. If the one that is more important is gonna get some more weight, they are not gonna try to depress. Typically if you do this, you're gonna depress them but not make them zero. If you do the L1 norm, the big difference that you actually make them zero, okay? You can exactly kill them and make them equal to zero. This can be interesting if you want to read out and interpret what you did. That's somewhat at the one level the main difference. Absolutely. Yeah, yeah, yeah, absolutely. Yeah, so what I'm trying to say is you can also view this as a measure of sensitivity related to somewhat taking the gradient of this with respect to the data and see that you don't change too much this with respect to your data. You derive this not with respect to W, with respect to say X and Y. Absolutely, yes. Going back to the question that you asked about, okay, but can you do it in another way? If the problem was a small eigenvalue, you can. That was a good, we did this yesterday night, very late. So, see here, okay? You've seen a minute that we go down this path tomorrow in a kind of a strange twist but you can ask the following question. All right, you're talking about statistical learning but statistics disappeared and all I'm talking about is linear algebra, which is a fair thing, okay? I view it as a plus because it means that you can recycle what you know and just pump it up a little bit. And if you go that way, you can say, okay, but in that book I saw, they said there are other ways to solve a linear system in a stable way, okay? But for example, if I tell you that the problem is that I don't wanna put here one over S because the last eigenvalue is small and you wanna fix this, how would you do that? Again, the question is, you have this expression and you say, oh, I don't like this because this can be very small. As a matter of fact, probably doing this is not the first thing that any human being would think about, presumably, right? What would you do? They're ordered, right? So the first one is the biggest, the second is the second biggest and keep on going. So to fix this, what would you do? You ignore the last one, okay? Because the first one is big, but that's the correct thing, clearly, right? So it's just, it's the last one. You say, oh, this is too big, let me put R minus one. Well, R minus one, maybe R minus two. Well, you know what? Let me put here M and I take M smaller or equal than R, okay, and that'd be fine, okay? That's fine, okay? Now, what did we do, though? Well, first of all, we discovered what is called truncatory single-value decomposition, which is probably the first of the first of the first way of doing this. Second of all, as you see Wednesday, this is also what you might call principal component regression because it turns out that it can only event things twice if you publish into different communities. Remember this. And third, did we actually solve the bias problem? Not really, because see, here I have lambda and there I call it M, and they play the same game. Here, if I give you infinite data and perfect information, if you put lambda different from zero, you're killing stuff. Here, if I give you perfect information, you're still killing stuff with M. It's true that M, that lambda is bothering either the large eigenvalues. So in some sense, the question I showed you before, basically says, if the eigenvalue is small, I'm gonna bother it. The truth is that this is incorrect. You still add something also to the large one. This one doesn't. And in fact, they have slightly different properties. They're two low-pass filters. They have the same shape, but they're slightly different properties. You can quantify them once you go and do theory and try to make sense of all these condition numbers, this, stability there, random age theory there. Essentially, what you really do to quantify things, you have to put this into the picture. You have to do stability analysis in a quantified way where you say, okay, I'm gonna take what I do on the data, and then I'm gonna take the same thing on infinite data, and I have to compare, and then I'm gonna have approximation things when M goes larger or lambda goes to zero, and then a stability thing when you replace this by that. And then you can quantify whatever we're saying. But at a high level, there are just two different ways to do the same thing. And that's exactly what we're gonna discuss on Wednesday. We're gonna start exactly from this, to introduce principal component analysis, a generic projection method, and then we're gonna try to see how to do it in a smarter way. Yeah, you bother them all. Yeah, yeah, yeah. So, I mean, in one case, you say one over S or zero. In the other case, you say, you only says S divided square plus lambda. So even when S is very big, okay, I cheated. It's not quite, you still add something. Here, you add absolutely nothing. Here, you add something. But it turns out that it doesn't make them, they're more similar than different. Let me put it this way, okay? Okay, so most of these series of lectures is about doing this in a different way and try to understand that there are some good reason to do that, okay? So this is what is called the empirical minimization principle, and this is incarnation in terms of penalized empirical minimization. It's called M-estimation in statistics, okay? And again, there are a lot of names for the same kind of ideas. And most of what I'm gonna try to tell you tomorrow, and this is not the only way to do things. It's nice because it's a nice principle, okay? We understand that. But there's not the only way to do things. One thing I'm not gonna do is go beyond the simple setting. So I'm also gonna show you how to replace this whole thing with something that has roughly the same thing, okay? The main difference is gonna be in the computations, okay? But you can also ask, can I replace this with another loss function? Can I replace this with another norm, okay? We're not gonna discuss that too much in these lectures, not because it's not a good question, but because already in this setting there is some interesting question I wanna stick to those, okay? The other obvious thing you could ask is, can I, fine, but if you're still looking at linear function, can I go away from them? And we are gonna discuss this for the next 20 minutes a little bit, if I have 20 minutes otherwise it's gonna be, yeah, okay? So we want to discuss this a little bit and we take a slightly longer tour and we talk about no linear functions only today, okay? And essentially we do it in a simple way. We don't talk about neural networks, they would make all this story much more complicated from a mathematical point of view and it's much less understood. We do it in a kind of more classical way that makes the math the same, essentially, okay? We essentially try to find the one way they make whatever I said for linear function go through for certain classes of no linear functions, okay? So that really almost everything up to a little bit of abstraction remains the same. Before doing that we ask a question about computation which is kind of an appetizer, okay? So we're gonna be a little appetizer interlude which is based on one observation. You remember here when you solve this, think about computation for a second. If you solve this, you're solving a matrix is D by D but in the pseudo inverse we said, oh, we can actually pick the other one. We actually pick the N by N. Remember I have two way of writing. Can I play the same trick? Why would I wanna do it? Because again solving this would be insane when D is much more but bigger than N, right? So forming this matrix is gonna cost me N D square and inverting is gonna cost me D cube. I don't wanna do it when this is a million, this is 10. What did we do before? Well you can read it in many ways but essentially there is one nice identity which is really simple which is just this, okay? So let's see what I did. Let's convince yourself that it's fine and then I tell you how to prove it and we don't do it. I took this matrix and I move it in front, okay? What happened inside? They had to swap the order. Now this makes sense for a pure, the maturity point of view because this is D by N. So when I move it in front, it has to see something which is N by N not D by D and this is N by N, okay? So if you just see the dimension, again this is D by N. So when I move it here it has to act on something which is size N. So this has to happen. Why does it happen? Why can't you prove it? Well if you just plug in the SVD, every time you see X or X transpose, you get two identities, okay? You get this equal to that. So there's nothing going on. You just do it in exercise. You plug the SVD of X, every time you see X and this is equal to that. It's easy to convince yourself. Well now that we did this we're back to business because now you see all I have to do is to invert this matrix, okay? So there are two different ways to write the same solution but now all I have to do is I have to solve this problem. So whenever you see the role of N and D are swapped, okay? So these become cubic at most in N and quadratic but only linear in N but linear in D, okay? And then I have to store the whole matrix, okay? So on Wednesday we're gonna ask how the hell do you do this when this is a million and this is a million, okay? Because clearly you don't do this. The problem is not even this, it's this. You don't store it anywhere, okay? So tomorrow and Wednesday we're gonna talk exactly about this, how you can even start scaling up these things which clearly will not fit, not your laptop but a supercomputer, okay? In the next few minutes I want to take a couple of observations. The first one is, and again this is a little bit appetizer, you can give an interpretation to this which is yet another perspective on what you're doing. X is the data matrix. This is just a vector of N numbers. So I can rewrite this expression like this and view the vector I'm looking for not for as a generic vector but a particular vector which is a linear combination of the columns. No, sorry, rows, okay? So each row is D dimensional and the vector I'm looking for is not a generic vector, is a linear combination of my data with weights given by this expression, okay? Is effect, is effected, is trivial, there's nothing going on. Why would you care? Well it turns out that this is useful for two reasons. The first one is this. It just gives you a new interpretation of the algorithm. Plug this W in the expression of F. F is just the linear function with this coefficient. Now you can plug this expression here and use linearity and you get this. And now you get yet another interpretation of what you're doing. When you get the new point X and you want to predict what it is, you're gonna take inner product with the data you already have and put the weight and this is gonna be your way to determine the output. Use the word similarity to describe the inner product because after all if these data are centered and norm one, that's just a correlation, okay? You're basically saying to determine what is the output of this X, I'm gonna take the correlation with all my training set, weight them and sum them up and this is gonna be my output, okay? So the output is a weighted sum of correlation of the new input with the old inputs. We turns out that this simple expression will be also the key to go to what are called non-parametric models. Non-parametric models are the case where all of a sudden you work with infinite dimensional functions, okay? Turn out that this one very simple expression here will be the one key to go to infinite dimensional model. If I have time I'm gonna show you why. The spoiler is that essentially you replace this inner product which is a finite sum with an infinite sum which is a series that converges, okay? Anyway, keep this in mind because if we get there, this is how you go to truly infinite dimensional models that are called non-parametric and statistic. So also admittedly, this last part is a plus, okay? So what was really important for this class was up to now. I'm sure that now your head is a bit spinning with some ideas if you're not familiar with this. So in this kind of class typically you give students a bit more than what they can digest and that's the part that I'm, I know I'm abusing you a little bit but it's, you can go back to it later on. What do you mean? In the first slide, you replace the number 5, 5, 5, 5. W is a linear combination. Oh, because C depends also on here. Well, I only mean this, okay? It's a, I'm just saying once you can fit the coefficient Ci, you're saying, so what he's saying is that C itself depend on X. What I mean by this is just that I can write it as a weighted combination of the Ci where the coefficient might depend on the Ci themselves, okay? I'm just saying solve this linear system, you get a vector, you get that. That's all I mean, okay? You're right. Okay, so let's go to the linear models. There is one thing that you do in one line and one thing that takes two lines, okay? The first thing is to say, okay, what if I started from the outset to tell you that I was not working with linear functions but what I was looking was with the combination of no linear functions. So instead of starting with this, we could have started here and say, what are this? You tell me, sine and cosine, exponentials, polynomials, you name it, okay? Give me one, boom, plug it in there. But you can also say there were sift, hog, if you like speech, MLCC, if you are looking for a specific problem where your friend knows that momentum, whatever, energy is what important, what matters, you can put it there, okay? There is a set of functions and they need not be linear. So these functions are not linear but I combine them linearly. So from a function point of view, you're going from a straight line to a jagged thing, complicated thing, but from a mathematical point of view, this is a vector of p dimension and this is a vector of p dimension too. So if I call it phi x, I can write it like this and then we can stop here basically. Everything we said so far is fine as long as we preprocess the data and replace x with phi of x. So if you know these functions, if you know them, if you list them, you can just compute them on the data and you're back to, I just called x, x tilde, I called x phi of x, but that's the same, okay? Everything else is a destruction from this basic fact. We did nothing, we changed the name of something. So what you see is that you have to give a name to the data matrix after the preprocessing, I call it phi hat, and then you use it and there's nothing going on, okay? So that's the first observation. If you're willing to pre-define your set of non-linearities and combine them linearly, then everything we did works for non-linear function and this gives you a lot of mileage, right? Because you can take a polynomial degree to gazillion 53, you can do that. You can write it explicitly, you pre-computed on your data and boom, you're now working with thousand dimensional polynomials, okay? And now imagine that it could fit everything like this, imagine what you can do now. And imagine that now it's even more important to understand where p is larger than, when d is larger than n, because now p I can generate, that's what I hinted at the beginning, I just take product of coordinates and then sine and then cosine and then whatever I want. Okay? So the first step to go from linear to non-linear is completely trivial. I just so realize that I can combine non-linear features. Okay? I really don't wanna talk much about this because this just is an excuse of why I start from line one and I talk to college students, to PhD students about linear functions because by just by doing this, at least you can go immediately to whole non-linear function which has linear combination of finitely many non-linear functions, okay? Well, let's make it even a bit more interesting. You remember this? Let's do the change of notation. Whenever you see x, I'm gonna write phi of x. Whenever you see x hat, I'm gonna write this phi hat, which is just the data matrix again, I have it somewhere here where I do the phi of everybody, okay? Just it. Okay, I'll write it down and then we stare at it a little bit, okay? Okay, so this was the vector, it was the data transpose c, which means that is the sense we clarified is a combination of the input point with some coefficients, but we also said, oh, I can plug it back here and I get this expression. And then what I see here is that expression depends on the inner product in this new feature space, okay? Very enough. What about the coefficient? The coefficient depends on building a certain matrix which is still n by n, because remember, this whole game was to go away from the large dimension and be able to just deal with n dimensional matrix. So this matrix is n by n. What's in each entry? Well, it's again this kind of magic product of each input to every other input I have. So the function depends only on this input, on this inner product, and the matrix that I used to build the coefficient depends only on this inner product. There's nothing interesting going on in here, okay? I just replaced the inner product with the new inner product after I take the preprocessing. Agree? Now there's gonna be a bit of magic. Can I compute this whole thing if the number of features is infinite? It wouldn't be interesting, because instead of committing to a thousand and a million, I could go to infinite. But now infinite doesn't sound like something a computer like, okay? But everything here relies on the fact that they can compute efficiently this inner product, okay? I compute this inner product, I compute this inner product, if I can do it. Now, can you imagine a situation where this infinite series converts to a computable number? But series do that usually, right? A series is a sum of numbers that is infinite, but has a finite value. So it's not unconceivable that an infinite sum will give me something that I cannot, I don't wanna use this expression to compute it. I hope that I can compute it regardless of that expression because I don't wanna do an infinite sum. I don't wanna do an infinite sum and chunk it. I just wanna look at the sum. But we know that there are series that converge, okay? Well, let's give an example. Beautiful, and that's exactly what you want after one hour and a half of somebody shouting stuff. But it's actually completely trivial. So start from the bottom, okay? So if you take these crazy features, okay? If you take these crazy features here, turns out that you can actually write down these expressions and it's just Gaussian, okay? So this is just, take it as one example, okay? That calculation in the middle is completely trivial. So let's just take it first as an example. I told you that there might be cases that we agree, hopefully, that there might be cases where I can put an infinite and still be able to compute. And here I give you an example. And these particular features are a combination of, I'm taking it in one dimension just because I'm lazy and my notation become a bit annoying. And so in this case, they become an interaction between monomials and exponentials with some weights. That's it, okay? How did I choose them? Well, I actually worked reverse engineering. Did I start from the exponential? I took the square. I get an exponential here. Well, I start from the Gaussian. I take the square. I get an exponential here. I get an exponential here. Then I get the mixed product. But it's a geometric series. And then, okay, just write down what the geometric series is in terms of its Taylor expansion. That's it, okay? So I go up. So there's nothing interesting in this calculation, right? You can do 15 others. Every time you can take a function of two variables and you can do a Taylor expansion, you can imagine that you have that. But then you don't wanna do that. You can start from whichever one you want, okay? You can start from these features and then say, can I go to infinity? But you can also think of starting here. Now, what I say here is kind of black magic, okay? So it's simple mathematically, but you still have to convince yourself because all of a sudden you can go to spaces where the number of features is infinite. I do this way because from a procedural point of view it's completely trivial, okay? All you do is to say, you remember this expression? I found myself features that sum to infinity and I called them K, for example, the Gaussian. Now I can write the function like this in the coefficients like this. Or the kernel where the matrix is just this stuff here, okay? So I'm just telling you more how we do it, okay? Then the deep reason why this matters, but the thing is we were able to go from 10, 15 minutes from linear functions to nonlinear function that are finite combination of finitely many elements to infinite combination of nonlinear functions. As long as this converges, okay? Now these are large spaces, okay? If linear functions are small, these are large, large spaces, okay? This space, an infinite dimensional space which is a combination of Gaussian is a large space. You can actually approximate anything. It's dense. It's a polynomial of infinite degrees, okay? So now you can ask yourself what's not in there because if linear functions are like that big, this space is huge, okay? And we basically wanna stop here, this story, okay? So we're gonna make three remarks and then we're good. The first remark is you can now start to install from features, from kernels, from this finite sum value and ask yourself, okay, which one are good, okay? If I found a K, if I found an expression, when is it good? It has to be symmetric and positive definite and here I remind you what that means, okay? But again, this is just a glance through and here are just some examples and you can do a lot of engineering, okay? Out of these things. And here really, this is just a peak, okay? Everything we said today, I think, has the nice fact that connects machine learning not to a lot of different things and here the list is inverse problems, the one we mostly discussed, linear system. There is a whole connection with max margin theory, okay, the classical Vapnik statistical learning theory perspective. The whole last bit of math I show you connect to a lot of stuff, okay? And here I give you a sort of name because they're large. Reproducing Kernel-Huber spaces, Karun-Leuven expansion in stochastic process, Gaussian processes and what it's called in stochastic analysis, Cameron Martin spaces, okay? So I just went quickly through them because this is not what this class is about and what they take a mention for you is if I work with linear functions, I get quite a bit of mileage to go to no linear function and even infinite dimensional model and there's a lot of math, okay, that connect both to stochastic processes and functional analysis. That's the take a mention of this last bit. Killed one slide, but I want to show you what's completely left out of my story. So again, I can consider a different loss function. I'm not gonna discuss this at all. I'm gonna consider, you can consider a different norm. For example, a P norm, okay, or whatever you want. You can consider different functions and this we actually took some step and we took them quickly because they are either very long or very short, so we went to the very short way and we said, okay, I could do this where this is a nonlinearity or I could do this as long as I can compute this. What's left out? Well, we have no linear functions, okay? But we are nice to our math and we actually consider no linear functions that are parametrized linearly. So the set of parameters they depend on enters in a linear way in the expression. So we somewhat put the nonlinearity in here and left the W out. Same thing here. We put the nonlinearity on the axis but we let the coefficient out, okay? Why? Because all of a sudden lines become everything. We're still working with lines somewhere, basically. I spent an hour talking about lines because there's nothing, from an algorithmic point of view, there's nothing to talk about. Whatever you work out for lines is still gonna work out. Just think of lines in very high dimensions, okay? What's left completely out is where you actually start from this and then trap the parameter inside the nonlinearity, okay? We didn't discuss this at all, okay? And what is clear is that the math there is gonna be different because it's not the linear math anymore, it's not the linear algebra anymore, okay? You can think of yourself, can I still do this kind of reasoning? And in some cases we know the answer but the math, the derivation, the justification has to be completely different, okay? So many of the things I said translate to that setting and but only, as you might know, this is essentially the first step in which you go to neural network, right? You trap your parameters inside the nonlinearity and then you iterate this and you do okay, W1 transpose, you put an extra W1 here essentially, okay? And you keep on going. We don't go that way essentially because they're very little we know about that and so that's a reserve of discussion it's on right and so here I actually advocate the linear case that's one setting where at least can put some coordinate on some reasoning. Today was just a summary on a perspective which is kind of the regular, classical regularization of theory perspective on building learning algorithm, revisiting the classical empiricism minimization, maximum likelihood estimator, you call it whatever you want, okay? The thing that we're gonna do from tomorrow is that we're gonna mostly deal with linear models so essentially we're never gonna do this, we're not gonna do this and we're never gonna do this, we stick to this and we just ask ourselves is if there is different principles to build algorithms. The principle so far was, let me go back to the first slide and then I finished, how this whole story started, we said replace, how did we build the learning algorithm in these two lines? Replace the objective function with the empirical one, replace the space of function with a manageable one. Okay? Now clearly you can do it for other loss functions and other penalty and so on, you can still do this, right? You say, I take another loss function and I do that. The computation is gonna be different but I can do it. I add another norm and I can still do that. Sure, I take another function space, you can do that. What we want to discuss is okay, but is there another way to build the learning algorithm? We're gonna take the square loss and linear function as an excuse to find new principles to build learning algorithm and then they will most likely apply to all the other cases, so other function class, other norm, other loss and we only so far know about a bunch of them, okay? Okay, so tomorrow this whole story will make hopefully a bit more sense. All right, so I'm done. If you have any questions, I'm here.