 OK, shall we? All right, so this is the last of the series of lectures. Let's jump right in, because I'm going to repeat a few things that we have really seen a few times. I won't remind you yet once more what's the problem, because by now I hope you remember. You've got data today. You've got to make predictions about the future, and what you can use is the data, but you have to keep in mind things like stability, what you have today, the fitting too much, fitting too little, and things like this. And I like to think of these series of lectures. This is a lecture about principles on how you build learning algorithms, not that much about your design loss functions or regularization terms, but how you can actually incorporate them in something that allows you at the same time to keep abate the statistical aspects, the stability aspects, and the computational aspects. So the first day was devoted largely to classical stuff, like empiricism minimization. I hope I was thinking today. I hope that that class, hopefully, was at least 50% of the material was already familiar to the majority of you. It's somewhat more the background material. And here there is a short summary. You have an objective function. You replace the true objective function with an empirical objective over some function class that you can actually compute. And then, possibly, you add constraints. That was the summary. The nice thing about this is that it's somewhat isolated completely, modeling and the computational aspect. You first design, you replace a problem with another problem, and then you solve it. That's nice. And that's also limiting at the same time. Here, computation are treated separately with a direct solver or something else. And the stability is governed by the regularization parameter lambda. Yesterday, we spent quite a bit of time, again, looking at a simple case to introduce another idea. And hopefully, the take home message was you can do basically this just by running gradient descent. Heat the gradient descent. There's everything you've seen in the previous line, the modeling and the computations. So gradient descent, in this case, is not constrained, is not penalized. It just goes. And it doesn't go somewhere that you don't know. It actually goes to minimal nor solutions in a way in which every step decreases the condition number. So you know two things. You know that you go to the minimal nor solution of a linear system. And at every step, the condition number decreases. It gets worse, which makes sense. You fit more. You're more accurate. But if the problem you're looking at has a bad condition number, you might reach that. So their fair review solution, you somewhat are not that greedy about looking at the data. And you're more stable. But as you proceed, then you start to potentially be unstable. This doesn't mean that you get instability. I just remarked this, because especially right now, there are many experiences with deep neural networks where you actually go very large scale and you look at a classification problem and you don't see instability. You just see it reaching a plateau. And that's not so surprising. If you do have, say, a classification problem which is somewhat close to its heart, it's complicated because it's big. But it's close to be separable. That's what you expect. You somewhat reach this best separation between two classes, perception-like. And then you just let it go. Yet again, the way you reach it with gradient descent is somewhat in a gradual way where you control the complexity. You can do this proactively as a way to build learning algorithms. You can also use it as a word of caution in when you do whatever you were doing before. This basically says, be careful. Because by optimizing, you might already be doing everything. So if you don't see overfitting, if you design something like this, you solve it and you don't see overfitting, it's not so surprising, even if you put lambda equal to 0. Because maybe it's the optimization below it that is already doing the job. So that's something to know, because you're basically, when you do design an objective and then you do optimization, in many cases, you're doing twice the same thing. So you have to be aware of this possible interaction. It's very hard, for example, in deep neural network to turn off regularization completely, because essentially it would mean that you just have to optimize in some crazy way the gradient descent that I do. Gradient descent somewhat controls the complexity. OK, anyway, so this was more or less the take-home message of yesterday's class. And we discussed a bit for this. We kind of skipped for this, the idea that you can do no-linear extensions easily if you either use a set of basis function or no-linear features, not necessarily a basis, or kernels. It's for free. You can also do it for neural networks, but it's not for free. A lot of things that I said would be adapted a bit. So I would say this is roughly the take-home message from yesterday. Now, a thing that I remarked a bunch of times is this idea that in this second algorithm, it's hard to draw a line separating what we call statistics and what we call computation optimization. It's kind of hard. It's all bundled up. And a manifestation of that is that the number of iteration can be seen as the regularization parameter. It's exactly roughly like 1 over lambda if lambda is your penalty parameter. The nice thing about this is that you have one parameter that controls at the same time the training time and the accuracy of your solution. So now you can say, how much time do I need to achieve? How much time do I need for this machine learning algorithm? And the classic answer is, how big it is? Oh, it's a million by three gazillions. You don't need a lot of time. Now, it's also still true that how big it is matters, but it also matters how nice are the data or hard is the problem. If the data are very noisy, it's very different than if the data are very clean. If the problem is remarkably simple, it's probably very different than when it is super complicated. And we saw yesterday that in some sense, by the fact that now the training time becomes a regression parameter, that's a different stopping rule. It's a completely different stopping rule that basically say, look at the test error or something like that and use that as a way to stop. That's something that you would never do just thinking about optimization. But if you think about learning, you would definitely do it. And in fact, people have been doing it for like 30 years. And I guess the other take home answer was, this idea, as I said, is neither new from a practical point of view nor from a theoretical point of view, but somewhat the theory or the underlying understanding of this phenomenon is not as widespread and it's used. And so in some sense, I was trying to convince you that this is actually as good. The idea of using this is as good as this. They don't, this is not like the nice guy and this is the little ugly guy. I mean, you could start from here and they're, I don't know, I have hard time saying which one is better or worse. Now, what we want to discuss today, and we're going to take again a route that goes from classical to modern stuff, is the problem of memory. So here we say, training time, control statistics and computations. The number of iteration allows you to do that. The truth is that in most practical application, the main concern is actually space. So let's just look at a kind of object that we had to manipulate in the last couple of days. You close the data matrix, it's and then a couple of square matrix that you can build out of it. The d by d matrix that you might want to call the covariance matrix or the n by n matrix that you might want to call the Gramian matrix or the kernel matrix depending what you like. Now, think of n and d large. And one can be much bigger than the other, but they're both very large. If one of the two is very small, you can play this trick and try to get to the other one. But if both of them are very large, it's not clear. What does it mean very large? Well, do back of the envelope computation. Assume that you need 1 megabyte for each number. Just do a rough calculation of how much does it cost to store a number. And then multiply that by n times d. So if you have to store this matrix, you need n times d. So if this is 10 to the 4, if n is 10 to the 4, the whole thing is going to be 10 to the 8. Then when you multiply by the number of memory that you need in each of this, you easily get the origin of gigabytes. And what you see is that if you do a back of the envelope computation, around 10,000 times 10,000. So a square matrix, which is 10,000 times 10,000, is typically what you can handle slowly with your computer, with your laptop. And then you can say, OK, but my laptop sucks. But then you go in and you start to see how much you can grow if you buy more memory, buy more RAM. First of all, it's very expensive. But even if you start to buy quite a bit, if you go save 500 gigs or 1 terabyte of memory, it doesn't really save you. Because this thing glows quadratically. So if you go 10,000 points, 50,000 points, 100,000 points, it's growing quadratically. So every time you put a zero in one of these two, you need 100 times more memory to store it because it becomes quadratic. And so pretty quickly, when you go to the site that are a million by a million, you cannot ask yourself, how well does it work? It doesn't even make sense. You cannot do it. You just cannot do it. You cannot run it. What you can do is that you can, I don't know, throw away some data. It just doesn't sound like a great idea. So it's really a big issue. Modern machine learning, this is becoming really the big issue. And of course, now what we do in practice oftentimes, we resort to much simpler learning rules. Things like nearest neighbors, for example, because you can compute them. You can somewhat separate the computation of distance of everybody against everybody. You know it's not going to be very sophisticated. It's local and doesn't take any global structure of the data into account. But at least you can try to approximate well. So we don't want to discuss today how to deal with memory. And ideally, at the end, we would like to have a feeling that we can actually get something that is going to do the same kind of miracle that happened with grid and descent. Time controlling statistics. Here we like to have space controlling time and statistics. We like to have a way in which a memory parameter would be able to control at the same time everything. How much space you need, how much time you need, and how well you predict the future data. So that's the promise for today's class. So how do you go for trying to fix this problem? The obvious way is what you might call random projection or dimensionality reduction, or tell me how you want to call it. You take the matrix, n by d, and then you multiply by another matrix. I call it s because sometimes this is called a sketch. And the point is that this matrix is d by m. And m is supposed to be typically smaller than both n and d. So you're effectively starting from a big matrix, and you make it skinny. You keep this, you keep everything. You keep all the data, but you reduce their dimension. You can view it as building a representation, which is shorter just by multiple. It's a linear representation that is just making the data smaller dimensional. If you want this, the projection would be. So that's the first derivation. You lose information, right? The obvious answer is yes. Then the most subtle answer is do you? Again, the spoiler is you see in a minute you always lose information, and you always want to lose information. What do you do when you do rigid regression? You put the parameter lambda, and you say more or less loud, I don't care about small eigenvalues. You just kill them, right? So the obvious observation is the natural observation is I'm going to lose information. The more subtle question is, is it going to be useful information, or is that same kind of information that one is going to throw away anyhow once I do rigid regression or gradient descent? Gradient descent, if I stop iteration 100, is effectively not looking at a bunch of eigenvalues. So will this do something similar? Will this or somewhat throw away the information that I knew I was going to throw out anyway? That's the question. And again, I'm giving you clues of what's going to happen. This is the vector-wise version of that. So this is I take the whole matrix, and I reduce the dimensionality. x hat m is the metric with more dimension. Here is the same thing for just one row of the matrix, just one point. Notice that a projection would be this. Strictly speaking, a projection is this number attached to this vector again. When you do a projection, you take the inner product and add it with that length to the vector you use. Here, we just want to reduce the dimensionality. We don't want to leave in the big space. We want to work in the subspace of dimension m. But if you don't know what I'm talking about, then you don't care about my remark. It's fine. So that's it. That's the basic operation. Everybody's happy with this? OK, fair enough. But I haven't tell you how I choose s. I just told you that if I choose s, then I can do stuff. What can I do, for example? I can go in and plug s in empirical risk minimization. We want to mix this idea with empirical risk minimization today. So I can just take my data and just plug it in the usual thing. And I take a linear function here. But now the vector is going to be in our m. And the matrix I carry around is going to be n by m. So if you remember any of the computation that we did, now you're going to have them rescale down. So the key quantity is going to be m rather than d or p or whatever you call it. All right, so how do you choose the projection? Well, we're going to discuss for a little bit because it's pedagogically interesting the choice, the natural choice, I would say. The natural choice is what we actually mentioned a few times in the last class, which is the PCA or the SVD. Again, if you give me a set of vectors and you ask me what is the best m-dimensional representation, and now you can say either that preserves the variance of the data or that can reconstruct the data, then the best thing you can do is diagonalize the matrix, do the singular value composition, and keep the first k components, m components. That's what we do here. This is what we introduce the first class. This is what we've been using to sometimes write down more explicitly the solution. So v is a basis of rd, u is a basis of rn, sigma is a diagonal matrix, and the size of this is the rank of the matrix, which is certainly the smaller between n and d, but potentially smaller if there is some degeneracy. So what do you want to do? You say, OK, let me take now not all the bases, otherwise I just reconstruct the data. So I can express any row using all this, exactly. I get an identity. It's a basis. But what I want to do is I want to actually truncate this to m. I only take m of the columns of this matrix, which are the elements of the basis, and I use this matrix, I call it vm, as my sketching matrix, as my projection matrix. So I define x at m like this, but now I tell you how I do it. I do it like that. So the question is essentially how to choose m. Let's keep it there. So how do you choose m? At this level, you can say, at this level of generality, we haven't put any target into the game. We just said we want to reduce the dimension. How many of you know what PCA is? That's good. So if you know PCA, one way, one classical way to do that is PCA is a measure of how much variance you keep in the data. So one way to do is to say, I just look at the inputs and I declare that I want to keep this percentage of the variance in the data, and then I choose m in such a way that I achieve that percentage of the variance. So that's one possible way. We're going to question whether this is a good idea or at least try to understand whether this is a good idea. Similarly, if you try to reconstruct the data, you can try to write the same thing. So one possible way is to put the threshold and the amount of variance you want to explain and get the m that gives you that. We'll see today that that's not what we want to do here. So first of all, so m is a free parameter. One possible way is to choose it in the unsupervised way through the variance, but we want to keep it free for now and see if there is another way to choose it. So let's go in a little bit and see what we can do with this now. Opla. So one thing you can do is observe that if you play a little bit of linear algebra, you can also make PCA no linear. Let me take a couple of minutes to say this. If your data lie pretty close to a line, then PCA is basically telling you to do that. Just pick this direction. But if your data are more like a banana, PCA is going to tell you that you have to do this. You have to keep two directions because it's trying to find a linear thing. And clearly, you cannot fit a banana with a line. On the other hand, at least in this simple case, it's clear from visual inspection that this is indeed one-dimensional. It's just curved. So one idea could be, well, what if we actually take this and shift them to a space by taking polynomials of degree 2 of the entry? So we take quadrics. We're going to take the first dimension square. So this is, say, a vector x is x1, x2. And what you can do is to take x1 square, x2 square, x1, x2, that makes sense? And then you do phi of x. You call it phi of x. You can call it a feature map. This is just a set of features for each vector. I'm just doing a simple case. This vector is two-dimensional. I map it. Now the idea could be, OK, what if now I find PCA here? Basically, I move the data in a space where there is an extra third dimension. So this is, if on z1, well, this is phi x1, phi x2, phi x3. And these are these three dimensions. And then here I'm going to try to find the plane. But when I look back at the plane in the original world, this is going to be a curved line. That make sense? It's the same trick we used to use the idea of finding something linear to find something nonlinear. Here we just use it in the unsupervised setting. And the only thing, perhaps, that you have to remember is that when you find PCA, I talk about lines in the sense that when you find PCA, just to be sure that I don't lose you here, we play the same tricks. But this was x and this was y. Now this is just x. And we're not talking about a function that predicts the data. We're talking about a linear space that fits the data. So I have a cloud of points. There are no classes. And I just have to find a piece of paper that gets close to them. It's a subspace, a linear subspace, that is close to everybody. That make sense? The picture could have been misleading, because we did this when this was x and this was y. Everything is x. There is no y anywhere in the picture. And when you do PCA, where you find these eigenvectors, that's what you're doing. You're finding a linear span of vectors, which you can think of it as a plane or a line or a flat surface that is going to be close to your cloud of input points. And all I'm saying here is that if your cloud of input points is close to something flat, classical PCA is good enough, if your cloud of points is close to something which is curving, then you cannot use PCA. Or you can, but you're going to just have a first order approximation. But you can try to use the same trick we used to use nonlinear coordinates to try to somewhat get some curved surface by still just solving just an eigenvalue problem. Because the idea is that now once you have this, you're going to do PCA on this new matrix after the pre-processing. This is roughly what is called kernel PCA. Kernel here is a bit of a stretch, because you still just use features. So the question is if you can actually use kernels. Yes? In that space. Oh, no, you don't. Yeah, yeah, yeah, yeah. No, so I'm not making that statement. So of course, so what David is saying is how can you be sure that once you make this choice, then if you really get it right. I cheated here, right? I wrote the quadratic, and then I give you the right solution. In practice, you cannot look at it, and you hope for the best. But it's the same thing that we did for supervised learning, OK? It's where you have to put the prior information. I know again, I can answer this question, or you can answer this question, but I say, what did we do when we already saw this? Well, you don't know the shape of the function, and you try to dump in this set of features everything you believe can get you close to that, OK? So sine, cosine, prior information, whatever you know, and then you hope for the best, OK? And you just know that you have a bit more room to getting a rich model than you had a second ago when you were limited to linear functions. That's all I'm saying. So of course, as in supervised learning, at some point, you have to commit to a representation. You have to commit to a set of features. If that's good, you can use them and get a good solution. Otherwise, you don't know. So there's no magic here. You cannot look in the high-dimensional world. That's for sure. It's just, the take-home idea is just, you remember the trick where we went from linear to nonlinear? You can also do it for PCA. That's all. It's the same story, OK? With all the benefits and the limits. The limits is that you have to pre-specify the features. The benefits is once you have them, you can play the game. That's all, OK? So we want to move on and just see if we can actually can use kernels. You remember what was the trick with kernels? We tried to show that all the computation needed only the product of two vectors like this. And then we said, well, if this product, which is something that is going to look like, if everything depends only on product of two vectors, phi x transpose, then I might be able to find cases where I can let p go to infinity and still be able to compute everything, OK? So what we want to do now is that we want to show that when you do PCA, all you need are inner products. You never need to treat a vector alone. So all you need is actually inner product of vectors, OK? And this will lead the page not only to use features, but to use kernels. And so non-parametric version of PCA, OK? Is the question clear? I just want to show you in one slide, because it's really easy, that by using basic property of the SVD, we can show that when you compute an SVD, you don't actually need to work with the vectors themselves, but all you need is inner products, OK? Let me show you how. There are three steps. So let me tell you before you start getting lost in the slides that contains too many things, let me tell you what's going on. The game is going to be essentially starting from the basic SVD expression, use it to derive an expression for v, and then plug it in the sketching formula and see what happens, OK? So these are those two steps. The first step is just look at this. If this is true, then I can also use it to derive the form of the transpose of this matrix. This is standard. And what you see is that essentially this is n by d. This is d by n. So in some sense, you switch u and v, and that's the new expression. Now, out of this, can you derive an equation that tells you how v can be expressed in terms of everything else? Well, let's see. If you multiply this equation on both sides for u, this is an orthonormal basis. So when the vector is there themselves, they say either 1 or 0. So you get an identity, OK? So if you multiply here and here by u, you get a u appearing here, and this u transpose disappears. So you see that there is a u appearing here. Then if you want to get v in function of everything else, you still have to invert. So let me do it here. So it's x transpose at v sigma u transpose. Let's give me some room here so I can cheat. First, I do this, and this goes away, and I get the u here. And then I have to invert this. But this is just a diagonal matrix. So the inverse is trivial. It gives me another identity. I'm just rewriting the SVD in a different form. Now what you want to do when you use this for sketching is that you don't want to use the whole thing. You just want to use this up to column m, OK? So you take it, you plug it in here, and then basically what it's easy to do, don't be fooled. This is n equalities, right? So this is a matrix, OK? And I have every vectors satisfying an equality. So I just have to take the first m equalities for this. So what you do is that you take that expression, you plug it in here, and then you're going to have that x is going to see an x transpose. I literally take this and plug it in here, OK? Is it up to here? It's fine. I just take that expression, I plug it in here, and I stop it at the dimension m, OK? But what is this? Can I use these two expressions to write this matrix? You can, right? Because you basically have to multiply this by this. You have the v transpose is going to see a v, and they squeeze, OK? And you get an identity. And then a sigma is going to see a sigma, and they become square. So inside of here, you're going to get u sigma square, u transpose, I write it. So x hat, x hat transpose, like this, u m sigma m minus 1. But just by plugging in the expression of SVD here, you get that this is equal to u sigma square, u transpose, OK? And roughly speaking, what happened is when this guy sees the u m, they get squeezed on the m-dimensional subspace. And then this sigma square is going to see this sigma m to the minus 1. And so it's going to become sigma m period, OK? So there is a square times the minus 1, and you just get 1. So this is the final result. I don't want to spend too much time on this. There's nothing going on here. It's just basic address. I'm just sketching the steps and the specific details of this step is something you can work out yourself. And that's a basic idea, OK? That's not what matters here. What matters here is what happened now that you look at it a bit more, OK? So the take-home message is just notice that up to now, we did this thing where we took x, x transpose, SVD. We rewrote v that is going to give us the sketch image. We plug it in and we get this expression, OK? And now we can use this inside this equation, OK? We have now to multiply x transpose times vj, OK? So this is already the projection, the difference between this line and this line. This is the projection of the data set itself, OK? If I have to project the data set itself, I only need to look at this quantity. If I want to project any point, any new point, this is what I have to do. So I have to go back to this expression, plug it in there, and use it directly, OK? And this is what you get. So this is the second equation over there is x transpose. So in this case, I'm projecting the data again, and this is what I get. In this take, I take any generic point, and now I multiply by this vj, OK? But vj is just x transpose uj 1 over sigma j, where sigma j are the singular values, OK? So again, it's the same equation on the top right, just for one vector. Plug it in here, and this is what you get, OK? So when you project a single point, what you have to do is that you have to take the inner product between this point, all the training set points in here, and then you have to use the value of these eigenvectors as coefficients and divide by sigma j. Again, there's nothing interesting going on here. It's completely trivial stuff, and all you need to see at this point, well, I'm trying to tell you to derive this, you have to just use the SVD and plug it in a couple of places, OK, and just see what comes out. But now you can stare at this and see what we found. What is um is the first m eigenvector of q, which matrix? This guy. How big is this guy? Well, this was n by d times d by n. This whole thing is n by n. What is it? Or is it just the inner product of everybody against everybody? So to compute um, you only need to build this matrix, which is the matrix of inner product, and diagonalize it, agreed? This you should remember, OK? All this game was to see that if you project the data, all you need is to diagonalize the Gramian, the kernel matrix of your data, the inner product matrix, OK? What if you have to project any new point? Well, now you have to do this expression, OK? And what is it? Well, you compute the eigenvectors of the Gramian matrix and the singular values of the Gramian matrix, which is n by n. And then you have to do this expression, which again is just inner product. That's it. We're done. Because now if I tell you, I want to work in an infinite dimensional space, what can you say? You say I take a phi, which is infinite dimensional, no, this matrix want to depend on inner product, this guy want to depend on inner product. So you can now use this expression. And if you find k, if you work that p with infinity, they're good. But the recipe for this would be take the kernel matrix. Oh, boy. For patua. No, no, no. It's a shime. Now, it's going to be a shime. I'm sorry to impose you some calculation in the morning. And I'm also off as a leap, so I probably spent too many words on this. But there's nothing going on here, OK? Just yet another exercise. You take this video, you plug it a bit. You get this stuff. And the main thing is that you just look at this and you're like, oh, I'm happy. I can put inner product. And if I like kernels, I can plug them in, which means that, going back to David's question, you still don't know whether it's a good question. But you can try to get a very large function space, a very large space, say the space of all Gaussians. And that's a very rich space. And you can see that that can somewhat, in principle, linearize almost everything. So now you give you some room to try to get the right answer. And that's just the new notation once you want to see this, OK? You would have to do, if you use features, this is what you can do. You just replace, remember, these matrix here were just the data matrix after the preprocessing, or of computing the linear features. This would be the case where you just work with features. But if you now work with kernels, then you just have to compute this. Don't be fooled. You don't need phi to do this. Or this expression here, OK? Where again, these are the entries of each eigenvector to compute a single number, a single projection. On one dimension, you need to compute this sum over the coordinates of that eigenvector. Sigma j are the element of the diagonal of this. And this are the coordinates of the jth element of this matrix. So we now have an interlude. So when you do PCA with really little work, you can actually do no linear PCA with features and kernels. And you can view it as some form of autoencoder if you love to talk about autoencoders. And it's some kind of no linear autoencoder. All right. What do you do with it now? Presumably, we've been talking about supervised learning for three days. We want to at least see what happens if you do it once you get labels. And as I said, the obvious way of using this is now you give me label, and I use them. But I don't use the original data. I use the squeezed data. So you give me now the vectors of outputs, and now I use it with the new matrix. That's it. That's the simplest thing you can do. And then at some point, you can worry, oh, should I further regularize? Wait there for a minute, and then we'll see. But this we discussed for a bunch of time by now. So I don't have to go through it again. We can just spit out the answer. Just for the sake of simplicity, I'm assuming that this guy is invertible, so that I don't fiddle around with pseudo inverses. After all, I made m small. This is m by m. Let's assume for a minute that it is invertible. Fair enough. We look at this. Does it look familiar? Maybe. Depending how much you look at SVD in your life. If you look at this, if you thought about SVD enough in your life, you actually look quite familiar. But if you're not, if you come from I'm doing dimensional introduction, because I do machine learning, that's what we do. And I do kernels and no linear stuff. Then it might look not so familiar. So let's just massage this just a teeny tiny bit and see if it looks like something we've been talking about extensively for two days in a row. If you know, I'm going to skip this because I don't want to just do the same thing over and over again. But now you could just say, well, can I just go in and plug here the expression we found in the previous slides? I told you, I gave you all these nice tools. X can be written like this, X transpose like this, X transpose is on the board, this guy can be written like that, blah, blah, blah, blah. So you can go in and say, OK, let me plug all this stuff in here and see what happens. I tell you what happened. You get this. This is what you get. There is an inverse here that comes from this fact that I'm somewhat inverting this, multiplying by this. Again, for those of you that would like to do calculations just by looking at them, you get a square singular values to the minus 1 here, but you get a singular value here. So it's a sigma to the minus 2 and a sigma. So you expect the 1 over sigma to pop out of this. And then this vector is size n and this vector is size m. So you have to have some different matrices on two sides. So just out of this blabbering, this kind of makes sense. And if you just do the computation, that's what you get. What is it? You take the output, you project, you take the inner product, then you multiply by that. This x is a distraction at this point. So it's just there to remind you, this vector is the size m. And then you have to find the solution. You have to multiply by the projected vector. That's what you need to do. So if you plug everything in, this is the expression you get. What is it? Well, it's that thing that whenever I talked to you about let's kill eigenvalues, you said, can I do that? That thing where if the eigenvalue is small, I just throw it away. Yeah, that's it. That's it. So I could have told you this whole story just by saying, let me tell you that if you take a matrix, you'd kill the eigenvalue. This is basically doing PCA and then least squares. It's the same thing. And again, it's not that surprising. If you think a thing about it, and it's kind of obvious. But again, if you come from the perspective of what you typically do, oftentimes what you do, well, I don't know if you do, but oftentimes what I do, let's say, is I project the data, and then I do whatever I have later on. And here this is telling you, be careful, because when you project the data, you're already doing those other things that we discussed for two days. So projection is now what you have here is that rather than thinking of this as a filtering, you can also think of this as a dimensionality reduction. And now the statement is dimensionality reduction used as a pre-processing in supervised learning has a regularization effect. And in this simple setting, it's not just any regularization effect, it's the exact same regularization effect of Ticano regularization, regression, however you want to call it, or gradient descent. Just again, it's slightly different flavor, because it's yet a low pass filter, but it's a slightly different shape. It's the one that kills, makes sense? And this is the summary of the story. This method, again, in linear algebra is well known, and it's called Truncati single-value composition. We discussed it quite a bit. You demanded it because it was the obvious way to do a certain thing. But now notice that I can just see it in a different way. I can think of doing PCA and empiricarisminization, one after the other in pipeline. And this is also a name in statistics, because the name game is awesome. And it's called principal component regression. It's a 1936 algorithm. You can view it as a filter function. This is the filter function I write here explicitly. When j is more than m, you let it go. When j, you take it equal to 0. You can compare it to this filter. There is a sigma square missing here, sorry. So there is a square missing here. This is the Ticano one. And this is the gradient descent one. So you can compare them and see that they're roughly doing the same thing. And now we can go back to your question. How do you choose m? Anybody? Sure, you can do the variance. Or you can do that. Or, or, all these answers you could have given me before this slide, right? But in this slide, I'm telling you something that is the clue. These three filters are the same, roughly. They all depend on a parameter. Well, I may ask you another question. How do you choose the stopping time? You remember? What do you look at? Do you look at eigenvalues? No. Do you look at threshold? What do you look at? The validation, the test error, right? So why all of a sudden here you want to look at eigenvalues? Who cares? All you care about in supervised learning is the test error. Maybe you have them, you can look at them. And there's nothing wrong with that. But the way you choose lambda, the way you choose t, the way you choose m in this view should be roughly the same because they're roughly doing the same thing. If the eigenvalue help you and they give you further diagnostic, use them. I'm not saying that it's illegal, but I'm just saying that that's the obvious thing to do here. Everything else is a distraction because we came from a narrative that was suggesting another story. We just came from this project des, blah, blah, blah. But again, when you get here, you're just saying, oh, I do a filter, and I decide to use this window rather than this window. And now I call lambda m. That's it. It's the same. So just do whatever you did before. Take your data, split them in two, try different m, and then pick the m that is best on the validation set. And this is not going to work great. It won't work like all the other model selection criteria. But it's a supervised criterion. Assume that, for example, you need a much higher variance in your input to predict your output. How do you know until you look at your output? You have to look at the test error, and that's it. So the take-home message of this is just that PCA, when used in cascade, which is a simple, unregularized least squares, is doing the same thing as regularized least squares, the same thing. And the regularization parameter, the one that controls the precision of your approximation, become the number of principal components. So if you ask yourself how to choose them, so don't fiddle around with any variance, let's look at the test error. That's it. So at this point, hopefully, if you had these two places in your brain, the dimensional vector reduction region of your brain and the regression region of your brain, you can now connect them. There is a perspective under which they're the freaking same thing. So when you do project your data, I just told you when you optimize, be careful because you might already be doing what you thought you were doing afterwards. I take a penalty, and now I optimize. Oh, wait a minute. This sounds like I'm taking a penalty, and I'm taking the penalty again. Or I optimize, and I optimize again. And here I'm telling you that you can now replace optimize with project. I project, and then regularize. Oh, wait. That sounds very similar to I project and project, or I regularize and regularize, because there is a sense in which all these words are somewhat equivalent. Now you can still use them at good use. For example, if you project linearly like this, and then you do some non-linearity afterwards, that's a different story. But this is simply, and I'm not trying to say more than what I'm saying. Inspect this, and you clearly see a set of equivalences. And you have to keep them in mind, because you might be doing, you project your data, you write down the objective function, then you optimize. So you regularize, you regularize, and you regularize. So you just want to know that you potentially are doing it so that you avoid being saying, oh, wait. I don't see you overfitting. I hope not. You did the same thing three times. At some point, you probably did a good job. Fine, but his old story was about trying to reduce dimensionality. And now the story is bittersweet in a way. So cheat and start from this slide. What's the cost of doing this? You remember the cost of doing this without the M? It was build this matrix and then invert it. So the cost was N p square plus p cubed and Np to store it. So this was time, and this was memory. What's the cost of that? Did you throw away data? We don't want to throw away data. We want to keep them. Did you reduce the size? Yes. Where is the size p? How do we call it M? So it's the same thing with p replaced by M. So whenever you see p, put an M. And at this point, the deadline for your conference is tomorrow. And you say, it's amazing. I did it. But then by 11 PM, you think again and you realize that you have to compute xM. If you have xM, you're happy and you can submit your paper. Unfortunately, now you realize that you have to compute xM. What do you have to do to compute xM? Well, to compute xM, you have to diagonalize this thing here. And you know how much does that cost? Like this with the p inside. So if God gives you PCA, then you're game. You're good. You can do it. You can do your filtering. You can do whatever you want. But if God is not your friend, that's a problem. Because now you're back to square one. We discovered that projection regularizes. You can do it linearly and linearly. And that's why we do this detour. But with respect to the question we asked the first slide, we made zero progress. We still have an algorithm which is essentially equivalent to one we have. And essentially, we have the same cost of the one we have. Because the cost is not this, but it's this. It's not this stuff, but it's this stuff. Not for the nonlinear case. That's the same story. So PCA is nice, but it costs too much. It would do the job, but it costs too much. And the forefront, you have to do something which is as costly as solving the supervised problem that you have to solve later on. So at least in this setting, it's not obvious why it's worth doing. You still have the nice fact that you can, if you want, you can interpret m as a dimensionality. You couldn't do it for t or lambda. It's much easier to think of m as a dimension of the size of my data. You can go back and connect the variance and the test error. So you get these benefits. But with respect to the problem of saving memory, we made not much progress. So what can you do? You would like to do somewhat PCA without doing PCA. Well, that's not easy. But that's the idea. What we're going to do is that we're actually going to do something quite brutal. We're going to replace the PCA matrix, the eigenvectors, with something a little bit faster to compute, a set of random numbers. So s, we do the same equation. But now s is a set of random numbers. So in the simplest case, this is just Gaussian entries. So you do run the n by m. You produce these matrix of random numbers. And then you use it. Compared to PCA, I guess we all agree that this is a bit faster. You don't have to do an eigenvector composition. You just have to do random generate a bunch of numbers. That's much quicker. The hope that this can, it's not obvious why this should work. Because to some extent, you just replace a very nice choice that you know to preserve the variance and blah, blah, blah, blah, blah, with a set of random numbers. So it's not completely clear why this should work. Of course, at this point, the problem of computation is gone, because you really have the miracle of dropping this down. But you don't have to any overhead to compute s, like you did in PCA. This is for free. So now you have the nice side without the cost. So why would this be a good idea? Why could it ever make sense? Well, a clue comes from looking at things in expectation. So take the projected data and look at the expectation of the inner product. Inner product and distances are broader, right? So if you like, think of this calculation as computing distances. And the idea is that we would like to show that it's true that we reduce dimensionality and we're losing. But this is a random dimensionality reduction. And we would like to see that, on average, we don't lose much. So if you can show that you preserve distances in expectation, you're happy. It means that, on average, you don't do anything, right? You agree? So if I could show that my random dimensionality reduction that reduce the size, on average, preserve distances, it means that, on average, it's actually possible not to do too much of a damage. Well, let's see. We look at inner product rather than distances, just because it's easy. You can recycle the result. And then this is what you get. You have to normalize the inner product, but that's not a big deal. You take this, you plug in the expression, and then you see that there's an S transpose popping out here. Now, by linearity, you can push the expectation where it belongs, because the only thing which is random here is S, not X. X is the reason it's now fixed. And then you get this expression. But this is what is a set of random Gaussian numbers multiply by themselves. But this is just all independent. So this is exactly like taking a multivariate Gaussian. And that's its second moment. So it's a zero mean centered Gaussian vectors. And so this is just the identity. This is a simple calculation. All you need to remember is the expectation of a set of covariance vector is just the identity. And boom. You got that all of a sudden, this stuff in the middle disappears, and all you get is the original inner product. So this is a hint that there could be a saving grace here, because we're projecting random. But random here is not random as in our random data. Our random data are random with a lot of structure. They don't go. They're not just random dimension. There is a lot of structure. They are line some dimension, x and y are governed by certain laws. This is random in the isotropic uniform agnostic sense. It goes everywhere. And so because of this, because we project a bit everywhere, on average, it doesn't do anything. So the fact that we take random projection, which are isotropic in a sense, it could be the saving grace. And it turns out it is on average. So if you do this many, many, many, many, many, many times, and you can average the result, it could do a good job. The problem is that you want to do that, because if you have to do that, then you have to pay an extra cost and do it many times. But this at least tells you that this doesn't sound like such a crazy idea. Yeah? I'm assuming that everything is independent to everything. Everything is independent here. So I can use that. So they're all independent. So you can further expand this in all the entries. Everybody against everybody else. And when you see somebody and its own component, it's going to give you one. But everybody else is independent. So when you take the product to independent numbers, you get that the second moment becomes zero. All right, so for those of you familiar with this, this is not unrelated to the Johnson linear source theorem. That's basically what this is about. What we want to do is use it. And the basic intuition here is that this doesn't sound crazy, but we might get extra variance. Because this is not always right, but it's light on average. But there are some bad cases where I'm projecting bad. And I might have that I'm actually losing quite a bit. So for this case, typically what you do is that for this reason, you still have to add the lambda. Now you are going to have a lambda and an m, both appearing in your equation. And they somewhat are both playing a role. Again, this becomes your matrix. And again, you have that in expectation, you get the right quantity. But you have some extra variability to take care of. And that's why you still add this extra parameter. In some sense, with PCA, you know that when you use PCA, in some sense, you are perfectly aligned with the first, the second, the third, the fifth, the sixth eigenvector by construction. Here what you have that the random situation where this guy can point in the right direction. So you can point towards the small eigenvalue a bit more. On average, it doesn't do it. On average, it roughly gives you back the right matrix. But the intuition is that there are bad cases for which, in some sense, you can go and explore smaller eigenvalues than what you wish. So you have to still add a bit of regularization. That's what I'm talking about. You have the variance of the inner product, and the variance of the inner product gives you an extra cost. So you can view it like that. You have the variance of the inner product. This will be an extra cost. And you have to be stable with respect to the extra cost. Or you can view it from a spectral point of view by basically saying the same variance you're talking about is the one that could point me towards small eigenvalues sometimes. Now, if you do this, you end up with this equation. So it's the same equation as before, but now you get an extra lambda. And notice that at this point, what you have is that you can now decide what you want to call regularization parameter. I give you two stories. One story is I do ridge regression, but I do dimensionality reduction first. So the dimensionality reduction becomes somewhat of a memory constraint parameter. And then lambda becomes the regularization parameter of the statistics. That's the most canonical and classical story. But let me give you another story that follows what we discussed today. I want to do regularization by projection, ala PCA. And so m is actually controlling the statistics. But I might have some bad cases. So I want to do a bit of preconditioning. Preconditioning means that I add a little bit of safeguards towards small eigenvalue. So I let one somewhat control the chunk of it. And then I just make, because you can make lambda extreme, in this view, you would make lambda extremely small. You let m do most of the job, you fix it, and then that lambda just save you from the bad cases. Or never, this will be a good idea. And you can prove that it kind of worked. So in some sense, again, we have that lambda and m can be somewhat interchanged in this story. You just have to figure out how to relate to each other. And the moment you try to make all these quantitative, you can actually prove it. You can prove that they're basically one inverse proportional to the other, and then you'll be safe to go. So the inception right now is that when you move from PCA to random projections, then you pay an extra variability. And now you have two parameters that are very much related, and then they somewhat concur to obtain the same effect, which is regularizing. But at the same time, m plays the role of fixing the amount of memory you need. So it's somewhat start to achieve what we hope for, which now we have an algorithm where we have an expression. And now you can relate, it's possible to relate m and lambda. And all of a sudden, everything depends on the same. You have m controls at the same time, the statistics and the memory requirement and also the time requirement. Because now the computation depends on how much the projection was for free. Now everything depends just on the dimension. So here, in the case of Green descent, it was very easy to write down the explicit expression. In this case, it's a bit more complicated, because one would have to write down the relation between the again value of this matrix and then game value of the original matrix. And that's why I'm not doing that, because this requires a bit more work than what we want to put in. So the take-home message of the story up to now is PCA is nice, but it costs too much to do the projection. If I replace PCA with a random projection, on average, I'm doing the same thing. But I have this extra cost. And so this extra cost, I have to be careful to pay in a nice way. Yes? Yeah, yeah, yeah. So what the cost I'm giving you here is somewhat worst case. Every time I wrote a cost on the board in the last three days was the worst-case cost. And numerical people would jump on the seat, because there are a bunch of ways to make things a bit better. For example, if you know the rank of your matrix, you don't need to diagonalize it all and then go back. Or there are ways to do, essentially, stochastic gradient descent to compute the eigenvalues. You can write down the objective function. You can write down an eigenvalue problem by writing what is called the Rayleigh quotient. You can write the Rayleigh quotient, which is that the quadratic form of the matrix, write down an objective function and solve it by gradient descent or stochastic gradient descent. And this would be a way to somewhat circumvent the problem of memory in principle. But in some sense, you still have to look at everything. It's a way to do more like what we discussed the previous days. It's a way to decompose the computation in chunks, but you still have to look at everything in the same way. You do a first preprocessing where you kill the dimension, and then you keep going. But definitely, you could do that. It would be a different way to compute eigenvalues using SGD essential. What do I have to do? The matrix, this matrix S. The matrix S, in this form, you have to keep. Yeah, at least in this form, you have to keep, because when a new point comes, you have to keep it, because it's the same one you want to use again. This is the basic form. So in practice, again, this is the basic form. This idea is what is sometimes sketching, and it's a very common idea in computational geometry and theoretical computer science. And oftentimes, what you do is that you try to choose matrix that are very efficient to store and apply. For example, rather than using a Gaussian matrix, you can use something like a Bernoulli matrix, where you start to have a lot of sparsity, and you can apply it fast. There is also a lot of literature how to find random matrix with a special structure, like a block structure, that you can apply extremely fast. And this connects to the whole branch of numerical analysis, where people do things like galerking methods or multi-resolution approaches, where you somewhat try to find matrix that are nice and can be structured in such a way that you do course-to-find computation, maybe use the fast Fourier transform, and so on and so forth. So behind your question, there is the tip of an iceberg of a lot of literature of how to do fast random projection. If you look at fast long-run Jonsson industries, for example, you find a bunch of results. There is really huge literature, so this is by no means the end of the story. It's just the beginning. And in practice, you can try to make storage and application of random matrices extremely efficient. And roughly, as long as you can ensure that something like this happened, you're good to go. OK, so PCA, fine. Leaner sketching, fine. Of course, now you can do leaner sketching of no leaner features, so you can play this game. It's not clear how you can do this, because you really have to sketch. You don't really care to do sketching for kernels, because it looks that you really have to perform this computation. And you do. I mean, XM is written like this. Nothing that you write can. One thing that you can do is to say, well, what if I, instead, I do no leaner sketching? What if I start to do something no leaner with my sketching? Instead of just computing the matrix times S, I actually put some non-linearity in front of it. An absolute value, for example. Or a cosine or a sigmoid. What do you get? Well, let's see. It is a y element y. Element y, element y. Sigma is an element y, it's an element y, non-linearity. To each entry of the vector, you do that. But does it look familiar? Well, let's see. Now, if you look at the function, you write down XM, blah, blah. And this is what you get. Look familiar? Somebody talked to you for five days and two weeks about deep learning. All right, what's the difference? What do you do in deep learning? Well, this is the most stupid neural net that you can think of. No offset, one hidden layer. M units in the hidden layer. All right, if you see it as a neural net, what do you usually do? You take this thing. You take some objective function of your choice, cross entropy, least squares, you name it. And then what do you do? You do SGD to optimize what? Here and here. But here, this is basically a strange neural network where what you do is that you initialize the inner weights and you don't touch them. You just let them be. And all you do is that you initialize the top layer weights. Clearly, this doesn't sound like a fantastic idea if you know how to handle the optimization of everything. But still, you can now see that it's just a special case of what we saw before. And it makes everything convex again because the only free parametrization you have is WJ. And it enters linearly in this expression. It's outside of nonlinearity, so we're good to go. So we want to think a bit about this. No linear sketching. What does it do? What are we actually doing? You can view it as a neural network. But you can also just view it as a special, in other case, of what we discussed so far. And then you play the same game. Sorry, there is a parenthesis missing. And you just solve the problem of something that depends on a parameter and so on and so forth. And you just play the game you've seen so far. The cost is roughly the same. So you can only ask yourself modeling question because from a computational point of view, everything is already taken care of. So what the hell are you doing? You're solving a problem where you have taken this one hidden layer neural network. You view it now as sketching, no linear sketching. And so you fix the inner weights and you don't optimize over them. You initialize them randomly and you keep them. OK. Now, the thing that is the observation that's been done over and over again is that there are cases for which you can actually see what happens when you let the, when you somewhat take the expectation with respect to, when you do the same kind of calculation you did before. You remember before I showed you, you can get to the inner product? Well, you can ask the same question. What is this? Is this an inner product somewhere? And it turns out that there are a lot of situations where we know it is exactly inner product. It's a special inner product that becomes a kernel. It's not an inner product in the original data space. It's an inner product after some mapping. So let me show you one. So the simplest case is this one here. You remember our friend, the Gaussian kernel? It's the only kernel I showed you so far. And I showed you that it can be written as an infinite use in geometric series argument. So the reasoning there was we take this exponential. We take the square. And then the mixed product, we expand the interior series. And then we build the feature map that way and become infinite. Let's do a kind of a continuous version of that. I want to do now. Take this and do the Fourier transform and the anti Fourier transform. So write the identity by taking Fourier transform and anti Fourier transform of this. And I'm lazy, so I don't write the constant, the normalization factors here. So S is the frequency. And what you do is that this is the Fourier transform of the Gaussian. So it's a Gaussian with the flip sign. This is the inverse Fourier transform. And this is the translation because I have this. And there is a prime missing here. Because they translate. You agree? OK. Nothing particularly. Prime missing here. Now look at this. And now view it a slightly different way. Put this guy together with this. This is a Gaussian density up to normalization. So now you can view this as an inner product. You can do it in two ways. You can view it as an inner product. It's the inner product of this function times this function. They're complex. You have to take the conjugate. And the distribution is not just a uniform distribution. It's not the uniform measure. It's actually this weighted measure with respect to this density. So again, the classic, the usual way to write down an inner product between two function f and g is to write down dx fx gx. But here we have a density, p of x. So in our case, we have that this interval is actually over frequency. So it's ds. The density ps is going to be equal to e to the s square divided by omega. And then we have a function which is e to the sx. And another function which is e to the sx prime. So the product is over s. The integrating variable is s. Is that fine? So that's observation one. Now what I could do is that I could now call this phi sx. And this is a number. You agree? It's just a number. It's this number here. It's a complex number. But it's just a number. So we usually wrote up to now phi x equal phi x1, phi x2. And this vector was in rp or little l2, the sequence of all numbers. Here we're doing somewhat the continuous version of this. Because now we have that phi of xs. We can view this as a free. Let me put parentheses just to emphasize it. It's now the free variable. And this is now a function in l2 s. In our case, this is going to be rm, the complex number. And then we respect to this measure here. So all I'm saying is this is an inner product. But instead of taking infinite feature maps in the sense of an infinite sequence, we're actually taking feature maps to be functions in an l2 space. That's fine, and it's not the main part of this story. So this is the measure of the space. These have complex numbers. Then this is it. If you stop here, you can complain. Because you say, OK, but you told me that the Gaussian kernel I can write as an infinite series. And that was already annoying. Now, why do you want to use the free transform that tell me that it can actually be written as an integral over not a series, but an integral. And the feature map now becomes functions. That doesn't sound fair. Well, the reason is because now you can actually look at this as an expectation. You can view this expectation of this over this weight. So you have not only an inner product, but you can interpret it as an expectation over an appropriate distribution. And now what you can do is that if you want to approximate this, how would you do it? Well, you do a Monte Carlo approximation. You sample this distribution. And then you just obtain an approximation of this long function by sampling it at end points. So you replace this with, so you take this distribution. You sample 1s. You can view it as a random frequency. So you take a frequency with Gaussian probability. You go around 0, and you take frequency with Gaussian probability. So things close to 0 are more likely than things that are far away from 0. Pick a random frequency, and you fix it, and you use it to build one no-linear sketching. So look at the no-linear sketching. The no-linear sketching is just this. You can connect back to the previous slides. Usually, the no-linear sketching is take a random vector, multiply, and then apply a no-linearity. That's what we're doing here. We take a random frequency. We multiply. And then we take a perverse no-linearity, which is a complex exponential, and renormalize. That's it. So again, let's see what we've done. We started by saying, let's take no-linear sketching. And then you can invent whichever no-linear sketching you want, and I'm venting one particular one because it gives us back another observation. If I decide that I want to invent that linear sketching, so I take random Gaussian vectors, and then I apply a no-linearity rather than a sigmoid, a complex exponential. And by the way, I don't do it here, but if you play around a bit using the fact that there is a certain amount of symmetry and real values here, you can actually replace the complex exponential with a cosine. And then it looks a bit less scary. It just becomes cosine plus a constant uniform offset. But this is the computation that doesn't need any further explanation. All right, I build that. And then for free, I get an observation. If I somewhat consider the case where rather than taking just M sketch, I let M go to infinity, I actually get back the Gaussian kernel. So in some sense, this tells me when you take a random sketch, if you could make it large, large, and larger, you'll be essentially approximating what would you do by using just a Gaussian kernel. That's what's written on the board right now. While this is not mathematically complicated, it bears a bunch of, again, connection with things that usually you don't put in the same place. So you need a little time of digest it. Again, the calculation is just this. You just take this. This is the only piece of math. You just have to stare it and say, inner product, expectation, kernel, neural networks, and draw conclusion. So if you view it from a neural network perspective, this tells you that if you take a shallow, one hidden layer in your own networks, and you let the hidden layer have random weights, and you take infinitely many, so you just one hidden layer, but you assume that data come from a Gaussian distribution, the data, that the weights come from a Gaussian distribution, and that it's infinitely large. So it's shallow, but infinitely large. Then it's just using the Gaussian kernel. So an infinitely large, one hidden layer in your networks will never be better than just using the Gaussian kernel. No matter how much optimization you do, in the sense that in the worst possible case, they spend the same function class. This doesn't mean that one explore a different subset of the functions. It might be that by choosing the size of the layers, you actually constrain yourself to look in a nice region. But they have the same power, so to say. There's one that's not larger than the other. So that's observation one. And this observation is old. I think Rat for Neal is one of the first three people that noticed this, I think sometimes around the beginning of the 90s. He didn't talk about kernels. He talked about Gaussian processes, but that's the same thing. So he connected neural networks with Gaussian processes, and he said, oh, by the way, Gaussian processes, one infinitely large, one hidden layer in neural networks is just a Gaussian process. This was yet another reason why people said, well, wait a minute, then why don't I just go shallow, if that's already so powerful? So that's something to keep in mind. If you take a very large one hidden layer in your network, you're doing nothing more better than just taking a Gaussian kernel and forget about depth. This doesn't mean that depth is not good, because I think the main point of depth is not they're just shoving stuff together. You get power. I think that's wrong. But it's true that depth gives you the freedom to incorporate structure. I have very little belief, personally, that deep networks, fully connected deep networks, will give you a real big edge, aside from implementation aspects to any of the simple stuff we've seen the last three days. I'm pretty confident that if you do things like convolutional neural networks or any other constrained architecture where you take the opportunity you have to actually insert prior information within the layers to constrain the way sharing in very specific ways, that gives you a lot of freedom. Because you're not just exploring the big space, you're actually biasing yourself to structure that might be interesting. And for example, for images and speech signals, that's clearly the case. So I'm going to show you if I've timed some experiments in a little bit that essentially suggests that if you use fully connected neural networks, and you can actually use one of these guys, the one we've seen, they work kind of the same way. Whereas if you go and try to use a convolutional neural networks, we have really no idea what they're doing and how to do it in a different way. So again, I'm not trying to say something particularly strong against or in favor of deep networks. I'm just saying this shows that just one hidden layer is very powerful. And now you can do multiple layers, but it's not completely clear what this is going to give you. Whereas if you start to put constraints in the layers as you do when you do convolutional neural network, that's a completely different business. Because you're just now over the whole big set of continuous function, you're just going to go in and just describe some specific sets. And if that's the right choice, that could be a very good idea. All right, so that's my five minutes blabbering about neural networks. The last thing is that you can actually take a kernel perspective on all this. And you're saying, so wait, OK, but so if I take any fifth layer I'm using the original kernel, but what am I doing in practice? You're replacing the Gaussian kernel with a finite dimensional approximation. You could have truncated the series, I told you the other day. Here you just do it by taking a random approximation of the Gaussian kernel. So you can view this if the kernel matrix, you want to use the Gaussian kernel, but the kernel matrix is too big. It's a million by a million points. You cannot do that. How do you do it? Well, you take this approximation of the Gaussian kernel, you now do the Monte Carlo approximation, and down your left to essentially having a finite dimensional feature map of dimension m, which in expectation is going to give you back the results. This is what is called random features. So I called it no linear sketching, but if you look for random features, random features are typically considered either as no linear sketching or a way to approximate kernels and scale them up. So there is a lot of ideas, so I let them sink in a bit, but that's basically the story. So PCA, nice principle, but too costly, linear random sketching, doing kind of the same thing on average, but with extra variability that we have to cure, then push up the idea of linear sketching to no linear sketching, and now discover a connection with neural networks, kernel methods, and all of the above. So a couple of comments. As I told you, this specific case, you can work it out a bit, and instead of the complex exponential, you can consider this. It's equivalent. This is the same as this. You just have to take BJ uniform between 0 and 1, or 0 and 2 pi, I don't remember. And these are what are called random Fourier features as a particular case. But you can also play the game of try to do the same thing for different kernels. So now you can take a kernel and try to see if you can write it like this, or you can take a nonlinearity, integrate out, and see if you get a kernel. And people have done this for a bunch of choices. For example, you take the ReLU and say, oh, let's take a product of infinitely many ReLU and integrate out and see what we get. Or you can take the sigmoid and do the same. And you get certain kind of kernels. So this gives you an indication of what is the largest space you can try to investigate with an infinitely large deep network. So real quick, no, actually, real skip. So I just tell you what's in this slide, and I'm skipping. In this slide, I'm skipping, there is essentially the idea of saying, what if I use the data themselves to project the data themselves? So I consider a subset of the data points, and I use them to project the data. So I don't throw away the data, but I just multiply the million points times m of them. And I use that as a sketching. That's one thing you can do. I want to skip this because I want to show you some little results. The bottom line is that you can do this in a couple of different ways. It works. And the next thing is that it has connection with the so-called Nystrom approximation of integrals. So if you look in the literature, this specific idea of using a subset of points to summarize another subset of points is related to what are called quadrature methods. You have an integral, and you just want to do a monocolor approximation. And you replace this integral with that. But you can also play the game if you have a big sum and you want to replace it by a smaller sum. And it turns out that when you use the data themselves as sketching, that's what you're doing. And in the slides, I kind of give you a bit of a hint for that. And again, as I said, there's more stuff than what I can present. And so if you want to catch up, we can do it. OK, so this is a bit of the summary. Projection regularizes. You can use projection to regularization. If the random projection, you have to be careful by the extra variability. So you can actually reduce computation regularized by doing sketching. And you can do this in different kind of ways. This stuff, as I said, on the one hand, it has connection with classical projection methods. And again, if you look at things like, for those of you that come from numerics, things like galerkin methods, quadrature methods, there is a whole stuff related to that. You can find those old things in theoretical theory when they talk about streaming and sketching. Data are too many. They arrive, and I just want to compress them. In the last few years, there has been a lot of literature on what is called the randomized numerical linear algebra. Essentially, you have to solve a big linear system, but you don't have the capacity to do it. So you squeeze it at random, and you solve a squeezed version, and you ask yourself, how far I am. Again, here, the idea is that you don't want to just solve that. You want to predict data in the future. So if you're actually able to solve the data system, data-driven system, with a given accuracy, that might be good enough, and you lose nothing at test time. All right, so this is the shameless plug. If you want to be in touch, there's going to be some school organized in June. So be in touch. I have a bunch of post-doc positions, so if you like what you've seen so far, drop me an email. But then let's go to the ghost track of this talk, because I thought I end up showing you some of what all this story can go. So to be fair, to me, most of this is like a playground. I want to look at, understand some phenomenon that I see in machine learning, and rather than go nuts with 15 layers of super tangled isn't that, I take linear function. I say, OK, let's first see what I can understand. Then I see if something happens, and then I know that at least I have some hope. So a lot of this, to me, is more like a proof of concept that certain things exist. Optimization, regularize, projection, regularize. Randomness relate to that. And I know I'm aware of all the limit of this once you try to play the same game other stuff. But it's also refreshing because you get back to a safe ground. And also to a benchmark ground, because you now have this and say, OK, let's see how it works. Because yet again, this might be simple, but I'm still working with computationally efficient algorithm. Now I can scale a lot. And I can use kernels and infinite dimensional function and see how far they get me. And I can go in and say, well, I have this fantastic GPU that I bought. I have PyTorch, and I can run a deep network for a week. But then what the hell do you do for a week? You might as well just do this in an hour and see how it works. And so that's what we typically do. So I just want to show you some results. So before I show you the crazy table, I'm going to show you results where we combine some recent work we did in my lab where we combine everything. So right now I told you, OK, penalization, optimization, and projection. But now I have not combined everything. So here we combine everything. We do a little bit of regularization by penalization. We do optimization. We do projection. And we try to some extent to get theorem that shows us that we can get the optimal result without losing anything and start to somewhat try to use the minimal amount of computations and memory. And essentially, at this point, we're able to solve no linear problems with kernels that start to be an order of a million or 10 millions. And so what we did is that we went around and checked all the data that we could find available that are more than a million. Because otherwise, you can do a lot of stuff. So we tried all the big ones. A million or more. And it turns out that, fun enough, in the era of big data, most of them are private. You cannot get your hands on them. So we actually find a bunch. And here I just want to show you some of the results we got. So for no reason, we called our algorithm Falcon. And these are all the competitors. We did not rerun their experiments. We just take what they have. But what we got is that essentially the first two columns are different measures of accuracy. And don't look at them. So first of all, you look what's here. Essentially, here there are trees, other kernel methods, and deep networks sometimes. So it's quite a large set of competitors. And then here, the first two columns are roughly accuracy. And the third one is time. And everything, all right. And what you see is that if you check the first columns, there's not much going on. I mean, we are more or less a little better, a little worse. I wouldn't bet my money on selling this because of this reason. But I find very interesting is the time we have. So for example, in this case, this is 55 seconds. And this is the competitors one. But you see all these weird signs. This is what they use. And this is, say, Amazon, cloud service, a cluster. All of them are running on distributed computation, whereas this one is on one desktop computer with one GPU. So you go from waiting this amount of time to have a turnaround time of one minute to get a result, which is state of the art. And here the story is the same. You go from 500 minutes on a distributed architecture to 20 minutes. You go from hours on a distributed architecture, just 1.5 hours on a single machine. And here is the first one you see deep networks appearing. This was one of the first cases, speech, where it starts to be clear that we start to be clear. There are some evidences or some clue that maybe neural networks of the full connected kind can be just matched by a clever shallow architecture. And then we bumped into the high energy physics data, essentially, because these people are dying in data. They have terabyte, exabyte here of data every other minute. And here the story is the same. Essentially, you can now go two minutes or hours to achieve more or less state of the art result. There is one case where there is a gap, and it's unclear. They use six times more parameters than we do, because at some point we hit the constraint of my poor GPU. So we need to use more computers, and we're doing that now. So this is, I'm not trying to sell you anything. I just want to show you that to some extent, all the tricks we see so far are not just for the sake of understanding, which is still my main motivation, but it also can give you something that you can use to some good use. Because at this point, you might have to wait a week for your PyTorch thing to finish going, but in the meantime, you check if something much simpler works. And in all these cases, they do. They do much better or in much less time even than much better. And they're like somewhat simpler to use. The algorithms are literally the one you saw. There is projection, penalization, optimization, and that's it. Now just to tell you that the story of this comparison is not just so simple, this is ImageNet, which if you don't know, you have not seen any machine learning talk in the last five years. It's the one day, one guy told another guy, what should I do to convince you that deep learning really works? And he says, do ImageNet. And he did, and now we're lost. And it's five years that we talk about deep learning. So this is the data set. And what you see here is that this is just a percentage of error of different kind of architecture. And what we did here is that we tried to throw, now we can train. We have essentially an SVM-like thing that we can train on ImageNet, which is 1, 10 million points depending on the version you get. And you can try to see how it works. And it sucks. It's really bad. If you just try to start from pixels, you put a Gaussian kernel, and you try to predict, it sucks. And it's kind of reassuring. Because if I could really take an image, scramble all the pixels, because a Gaussian kernel is completely independent of the position of every pixel, and get state-of-the-art results, it means that the problem of object classification must be really stupid. It's not. You have to be smart. People have been doing filtering, bank of filters, coding of filters, supervised learning for like 30 years to get good results. And if you want, the networks are now doing an optimized version of all this. And that matters. You have to look local, then a bit more global, then put some invariance. I don't know what that is doing, but it's clear that it's considering an image more than a set of numbers that you can reshuffle. So if you try to do that, that doesn't work. And how to do that in a kind of known end-to-end shallow way, it's very unclear to me. And we've been trying, but we didn't manage so far. Again, what you can do is to say, all right, maybe I should think of what I'm doing, like the first few fully connected layers. So you can say, OK, I'm going to cut the last few fully connected layers, use the features trained on ImageNet up to there, and see how it goes. And it goes well. It goes really well. You can just chop it out. You can wait now an hour to train it, and then you get good results. Then you can say, eh, but you still have to do those features. And you can say, yeah, but what do you do when you do this on a new data set? Do you really retrain whatever inception, view 15 on every data set you do? You typically can try, but you typically don't really want to do that. Because if you have to retrain, these things oftentimes need a lot of data to be trained well. But then if you go on Twitter, the source of all information, you actually find out that people say, well, you don't really need to retrain. You can do transfer learning. What does it mean? Well, it means that you take a new data set, and you actually don't retrain the old architecture. You train only the first few top layer. If you give me new data sets, I'm not going to try to retrain the whole thing, because it might be very hard. But if I just more or less use the previous architecture, and thus just retrain the few last layer, then that's much easier, and it works really, really well. And that's why deep learning is oftentimes used across different domains and different data sets. OK, but that's a game I can do too now. I just chop off the first few layers. I take those features, and I use them, and let's see how it goes. And so we work a lot with roboticists, and we've been trying to play in that game, and this has become really more of a useful application, and I should just show you a bit what we did. This was a detection problem, so it's actually a fairly complicated architecture, where you have to somewhat select different region of the image with some neural network architecture, essentially selecting the features. And we play the same game. At some point, we chop off the head of this whole thing, we kept it as it was, and we just redid our thing. And the long story short is that if you use the current state of the architecture to do this, they are based on deep learning, and they do some tuning of the last few layers, you get very good results in like 25 minutes. If you use this kind of hybrid architecture, where you chop off the head, and you put one of these simple shadow method, you actually get this roughly the same architecture in eight minutes. If you're willing to lose 4%, you can go down to 25 seconds. So you go from 25 minutes to 25 seconds losing this much. On, this is a benchmark data set, OK? If you go on our data sets, what you see is that you go from 40 minutes and 49.7% to a situation where you can go to 50 seconds and actually get slightly better results. Again, I don't believe this better work. This is not our favor. This is not in our favor. But these start to be small enough that it's not clear. But I really like this to this. And the robotics, they like it too, because now they go from offline to real time, OK? And again, I'm not trying to say that this is going to be something that you can use all the time. I'm just getting a warning that sometimes, perhaps, what you should do when you want to use machine learning, rather than using immediately the most complicated deep architecture and read the blog online that tells you how to choose the seventh layer learning rate, you can just try to use something similar to the scale up. If you have a million points, you have to use these kind of latest packages. If you have 1,000 points, you can just use whatever, OK? And I think it's fair to say that in a lot of cases, you can get these kind of results, because we've been doing it for two years, OK? All right, so that's basically really what I wanted to say. I thought to finish on this note just to give you a bit of a feeling of the context what you've seen. It's fair to say that the majority of what I want to show you today is just to give you a way to think about, boom, done. Learning algorithms, OK? But I just want to give you the feeling that we didn't just look at pure simplified theory and toy theory. You can actually get algorithm that you can train on 10 million points, OK, 15 million points, and get some pretty decent results. I'll show you the ghost slide before. All the slides should be more or less online. I sent them yesterday to Eric, all the videos are online. I have to rush, basically, as soon as the coffee break finishes. But if you want to stay in touch or you need a question or anything, just drop me an email, OK? So I'm all set, and I take questions.