 So, let's continue. What else did I want to say? Oh, yeah. I don't know. I don't see people talking to each other. I'm bored in the evening. So, 8 o'clock in the lobby. If you want to come have a beer with me, I'll buy you beer. Alright? I don't know if I'm allowed to do that. So, if I'm not allowed to do that, I didn't say that. But I'll still be in the lobby at 8 o'clock for mysterious reasons. So, let's start. So, last time we started talking about machine learning and what we wanted to emphasize last time was the difference between predicting and fitting. And so, I hope, playing with the Python exercises, you saw that there's a subtlety in all this stuff that we want to predict on data we haven't seen yet. But we get to only fit on the data we've already seen. And that's going to come up again and again in what we do. And a lot of what you need to understand about machine learning right now is how we do this. And so, the plan for the class is basically for the lecture is I'm going to do it all on the blackboard. We'll go back to Python notebooks tomorrow in the morning. But to get you there, we have to understand the basic ingredients of how you do this kind of machine learning. And there's basically three things we have to understand. So, there's basically four parts to a machine learning model. There is the optimizer, because remember we have to minimize the training error in some way or actually minimize the generalization error. That part of that is choosing parameters and we do that by optimizing. And we'll talk about what that means. Then there is the cost function, which is how we measure how good we're doing. And then there is regularization, which is a little bit more subtle. But regularization is basically the idea that we get to see training data. And that's minimizing the error on the training data. But what we want to do is generalize well. What we want to minimize is the error on data we haven't seen yet. And in order to do that, remember what we saw last time is you can have overfitting. And so to take care of that difference in what I actually optimize and what I would like to optimize, you have to do something called regularization. That's the third component. And then the fourth component is basically how do we choose the model itself. And modern deep learning is basically based around neural networks. Jeff Hinton wanted to rebrand neural networks because they were extremely unpopular, so he started calling them deep learning in the mid-2000s. And then because as hard as it is to believe now, if you worked out, there were like five people left working on neural networks in the mid-2000s. And everyone thought they should stop working on them and do support vector machines until 2012, until there was this amazing result called AlexNet. And I basically had to do with the fact that they put these neural networks on GPUs. And if I have some time, I'll tell you about GPUs and computation, but we're not going to get into that right now. So let's go through all this together. And so as I said before, we have these two functions we care about. E-training, which is measured by some cost function, where you just basically look, oh wow, I'm making a lot of noise. And you have some other thing you care about, which we're going to say is test, but it's not really test, it's the error on a typical piece of data I see otherwise. And I have some parameters that I have to choose. And again, I want to emphasize the subtlety. I keep writing these things that are completely redundant over and over again, because basically saying is I have access to this, but I care about that. But we know that if this is terrible, this is also going to be terrible. This is necessary. You have to be at least pretty good at this, otherwise you're never going to be able to predict on anything new. So the question is how do we optimize and choose these parameters? And the trick of modern neural networks, what's really important, and I would argue is that there's two basic things that happened in the last 15 years that are completely surprising. One is that if you talked, read the old neural network literature when it was really unpopular, I would say 2010, 2009, back when I started as a faculty member. What they kept telling you is that we just don't have big enough computers or big enough data sets. And that actually turned out to be true. So the first thing that has made this easy is that GPUs and the size of the data sets. So this is computation. But the second thing that is kind of surprising, and I think I would argue that no one really still understands, is the surprising power of differentiability, especially in high dimensions. And by that I mean is that you would think it would be really hard to learn a map from a million dimensional space to another million dimensional space. But actually if you regularize, or if you use the fact, the thing you try to learn is essentially differentiable, it turns out you can do this quite easily. And so this kind of differentiable architecture is key to understanding what's going on. And so the kind of optimizations techniques I use really assume that these kind of models that I have, I can take derivatives with respect to the parameters easily. So this key to this, right, this is the key to all. I would say the real key to all this, and we don't really understand, the idea is that we want to optimize by taking derivatives. And so for that reason I'm going to focus on optimization, optimizers that are based on taking derivatives. And we'll see what we really want to be able to do is take derivatives and then use the chain rule. And derivatives plus the chain rule is back propagation. That's what the famous algorithm was discovered in the 80s, but no one really appreciated how powerful differentiability was. And I would even still don't really understand why, once you make things differentiable, you can learn these maps from super high dimensional spaces in a way that if you made me guess even a decade ago, it would have been really surprising. I mean to everyone, everyone. I think Jeff and Jan were the only people who actually believed that this was possible. So let's begin. So generally what we're going to be interested in is that we're going to have some cost function or some error function. And instead of using C, I'm going to use E for all this stuff. But the important point is that the error functions generally have a very special form, which is that they usually sums of individual errors for each of the training data points. So the training function, think about least squared regression. In least squared regression I do yi minus my prediction squared, but the important point is that it's a sum over individual training data points. It's the error on each data point, and then you sum it up. Is everyone okay with what's going on? So this is going to play an important role in what follows in a second. And now I just want to minimize this error function. So right now let's think of this as an optimization problem. So how do I optimize this with respect to some theta? Say I wanted to optimize parameters. So what I want to do is I want to iteratively update my parameters theta. So given some parameters at theta t minus one, I want to update them to theta t. And the basic idea is that I want to do this with gradient descent. So I want to do this with differentiable things, because some of our differentiability seems to be the key to all this. So how do I do this? Well, it's pretty easy. One thing I can do, and I'm going to try to establish some notation, is I can basically calculate, apparently that is not going to, I'm going to keep using this over and over again unless I move this over here. Oh, it sticks on. Oh, how clever. There's Velcro. Oh, then you can wash them? Oh wow, Korean technology. One thing I can do is just calculate the gradient, and I can update, let me see, minus Vg. So I have some function I'm trying to minimize. I take the gradient, and I just update the gradient equations. And I've introduced this very important parameter that's going to play a very important role in all this stuff. This a to a t, and this is called the learning rate. So this is the dumbest thing I could do. I just take my energy function, I take a gradient, I go in the direction that minimizes it, iteratively. So what's the learning rate mean? Why do I need this parameter called the learning rate? Why do you think? Yeah, so generically, I didn't have to do that. I'm confused about the boards. But you know, generically, I have a complex thing. The gradient only gives me local information. So I only want to make local moves in my parameter space. So the gradient in general, the learning rate controls how much do I move? How seriously do I take these things? So generically, the idea is that these landscapes are complicated, and they're also in high dimensions. So if I'm doing just optimization, generally, I don't do gradient descent. I do something called Newton's method. How many of you have heard of Newton's method? Raise your hand. How many people know Newton's method? Some of you. Okay. So what's Newton's method do? How's Newton's method different from this? Here I just use the first derivative. What does Newton's method do? Who wants to tell me what I use in the Newton's method? I don't just use the first derivative. What do I use? The what? This is also a linearization, but this is the first derivative, right? So what I'm saying here, maybe, is I'm saying e theta is approximately e theta plus delta theta is approximately e of theta plus delta theta times gradient of e. And what would be the next term in this thing? What would I have? Yeah, you'd have the Hessian. That just measures the curvature. So what Newton's method says is don't do this simple thing here. Instead, what I should do is I should now use this update. Vt is equal to h inverse theta t where h is the same Hessian that appeared up here. Del theta of e. And then I do the same update. Did I mess up some signs? Oh, I don't know why there's this minus sign here. Okay, so I do this thing, and instead of a learning rate, I've used a Hessian. What's the advantages of this Newton's method? What do I get? Why do I, if I'm just doing optimization and it's easy to calculate the Hessian, do I use this Newton's method? What's the intuition? Sorry? It's defining variance. Is that what you said? Yeah, in some sense it measures, you know, local curvature, right? And one of the things it does is that it shoots different directions in different ways, right? So the problem in high dimensions is I hear the learning rate, I take steps, and I take steps in all directions exactly the same. But generically in these high dimensional landscapes, you have some directions that are narrow, other directions that are really big. So what this does is you could think of this as adapting the step size, right? So this sets the step size, the learning rate. What the Newton's method does is it says, oh no, I'm gonna adapt how much I walk in these different directions based on second derivative information. So we'd love to do that, right? And in particular, you know, in the one-dimensional problem, so right? Let me see what I'll do. Yeah, I'll come back here. And erase this. If I have a quadratic potential, right? Then I know that the optimum learning rate, right? So imagine this is just a quadratic potential. So this is just what every local minimum looks like. It's a quadratic. Then I know that the optimum learning rate looks just like the second derivative to the inverse. And in particular, what happens is that if I have n is less than n optimum, you basically take little steps to reach the minimum, right? So if n is less than n optimum, if I do gradient descent, right? And everyone sees that this is just the one-dimensional version of this Newton's method up here. Because that's like the optimum thing. If n is equal to n optimum, then I just reach it in a single step, right? If n is greater than n optimum, right? I need two more colors. Maybe I have them. If n is greater than n optimum, but less than, but less, and then finally I'm going to have n is greater than 2n optimum, what happens is that you can convince yourself, in this case, you just kind of bounce back and forth across here, but you still converge. But in this final case, if I make my learning rate too big, what's going to happen is you actually start diverging. You never converge, alright? So you become really, really sensitive to this learning rate. That's something you'll see over and over again. This is going to be our first example of what you call a hyperparameter, which is a parameter that matters a lot whenever you're solving a hard machine learning problem and you have to sweep over it. And what's interesting is this is in one dimension, but I know I can just think about this in high dimensions, and in particular you see that if I go back to this picture, maybe here, here I want to show both the picture and this thing at the same time, but I'll just write it again. It's probably the best way to write it. If I look at what my gradient, if I look at what my update is for Newton's method, you know that you can always diagonalize this Hessian. It's a symmetric matrix. You diagonalize it. And the eigenvalues, now the eigenvectors and the eigenvalue directions basically look like independent directions. Everyone okay with that intuition, right? You diagonalize. But then each independent direction is going to look like this. If I go to the diagonal basis, it's just an independent problem. But now you get into a big problem, right? Generically, you're going to be limited. You want to choose your learning rate smaller than the smallest eigenvalue. Otherwise you're going to diverge in some directions and converge in other directions, right? So that's why you're so sensitive to the learning rate because it basically sets a resolution and in high dimensional spaces you're going to have a distribution of eigenvalues. And you don't want to, in some sense, you could always choose this smaller, but then it becomes computationally more expensive, right? That's why we don't choose it really small. So there's all of these trade-offs. That's why we're at the two of these hyperparameters. So much. All right. So I would love to do this, but I can't. All right? So let's think about what's going on. So this is gradient descent. And what we're going to see is that gradient descent itself is not what anyone uses because it has some fundamental limitations and some of them we can overcome with other things and other things we can't, right? So let's think about why we can't just use gradient descent or Newton's method to optimize stuff. So Newton's method is the first thing you should ask is if you're remotely computational, why don't I use Newton's method if I have millions of parameters? What's expensive? I consider myself like a pure theorist and even I know what's expensive. What's expensive in this method computationally? Yeah, inverse. Because at each step, I have to invert like a million by million. It's actually much bigger than that. It's not even computationally visible, but even if it was a few hundred thousand, which is a small neural network, you have to invert a few hundred thousand by few hundred thousand matrix at every step, too expensive. So this kind of Newton's method out from the beginning. Can't have big matrix inversions. But we still would like to adapt the learning rate, right? That's the intuition that you want to get from all this. On the other hand, we can think about gradient descent. And what's important is to understand the limitations of this gradient descent. So gradient descent seems great, but then you start... This is going to be the bane of my existence. All right. Gradient descent. So where are the limitations? In order to understand why people use the optimizers they do, it's good to understand what the limitations of gradient descent are. All right. So I've said that most of them throughout this talk. So who wants to tell me some of the limitations based on what we've been discussing so far? What we did this whole discussion, so someone tell me, what are the limitations? Yeah, I mean, even without the matrix inverse, it's actually computationally intensive. Why is it computationally intensive? Because look at the error. It scales like the number of training data points. Right? So you have to do a big sum that scales with a huge data set. So this is just computationally impossible. This is what I would call computationally expensive. Right? So the first thing we didn't really discuss it is actually quite computationally expensive because at each step I have to do a sum over the number of training data sets. And as the training data set gets big, it gets really expensive. What's another thing that is a limitation of gradient descent? Yeah, another thing we didn't talk about, but for sure, it's true, is that you can get tracked in local minima and not global minima. Right? And for a long time, people have been thinking about, well, I erased a very complicated thing, but in general, you know, you have a very complicated thing. I need to make the learning rate small, but if the learning rate is small, then I get trapped in local minima. It turns out the intuition is that most local minima that are reasonable are pretty good. That's one of the interesting things about deep learning and why modern statistics fails, but you do get trapped in local minima. Okay? What else? How about in high dimensions? What was the point that we were saying about this? What is one of the things Newton's method in principle fixes? What do I say about different directions? How many directions? How does gradient descent treat them all? Yeah, it treats them all the same. But we know that there's no reason to assume you have uniformity, but a lot of tricks are also about making this thing. So gradient descent also treats all directions the same. So that's kind of annoying. And the last thing is that that graph says is that it's also very sensitive to the learning rate. So in practice, it's true that if you have infinite amount of time, the best thing is really playing with the learning rate and doing, well, not gradient descent, but stochastic gradient descent, which we'll get to in a second. But in practice, what you want is some stuff that works off the shelf. So no one ever actually uses gradient descent or stochastic gradient descent for the first time to train something until they have some reasonable tuning of all the other hyperparameters, but we'll get to that, right? So here's my gradient descent that fixes as many of these problems as I can. This is never going to be able to be fixed by any local gradient derivative based function, so we can just store that out, right? And then we're going to use... try to fix all these things, right? So the first thing that turns out to be true, and at first, would be just a trick for fixing computational problems, but turns out to actually work much better than gradient descent in practice, is something called stochastic gradient descent, right? And this is the workhorse neither of these algorithms is ever used. What is used in practice is something called stochastic gradient descent. I can't spell. Or SGD, as it's commonly known. And SGD is something that has its origins as a computational trick slash something, another field of machine learning called online learning. But in the end, it turned out that this actually works much, much better for a reason we'll come back to. That has to do with the thing I keep emphasizing, that you have access to one thing, but you're actually trying to optimize another thing, right? Training test. So what you do in stochastic gradient descent is instead of taking, oh, that's hidden, isn't it? Okay, I'm not going to do it there. I need this board. So what I'll do is I'll erase this board. So what we do with SGD is you say, okay, here's all these training data points, right? And then you say, look, I'm working with e-train, but I really care about something else, which is, you know, e-generalization. So it doesn't really matter if I get this error function exactly right, because this is just an approximation for something I can't measure anyway. Everyone see that that's the fundamental thing? I can't measure what I want to put. So what I do is something dumb. Okay, instead of using all the training data, let me make something called a mini-batch, right? And the idea is that a mini-batch, so where do you should know? Just like you should know learning rate, right? A mini-batch is let me take this training data and let me just take some small amount of training data. Say, like, say, 128 of these. Order 100 data points. And actually the mini-batch size doesn't seem to depend on the amount of training data you have. It's not extensive, which I've spent way too much time thinking about why. It never came up with a good answer. All right? And then what I do is I divide my whole training data set. You know how big it is? Into little batches. If I have batch one, batch two, batch three, all the way to whatever batch, the total training data size divided by B, right? So I have whatever batch. How many ever batches you have? So you have 100, 1 to 100, 100 to 200, 200 to 300, whatever. And what I do is instead of taking the gradient over the whole thing, I take the gradient over a mini-batch. All right? So at each step, what I do is I just say, OK, I'm going to use the same algorithm here, but now I'm going to just replace it by a mini-batch. And not only am I going to do that, I'm just going to go through. At the first step, I'm going to use the first mini-batch. At the second step, I'm going to use the second mini-batch in the gradient descent. Next, I'll use this mini-batch, and I just go through. And once you go through a whole thing, it's usually called an epic. An epic is when you go through the whole data set once. So what is this mini-batch? What does this SGD do? First of all, you see that at each step, I'm taking a gradient of a different function. And if you think about it, this is really, really small. It's really weird because it's such a sparse, horrible, noisy sample. In some sense, I say, maybe this is what I really care about, but I'm going to estimate it with this kind of horrible, noisy thing. But actually, the noise helps you. That's one of the big lessons. Why does the noise help you, you think? Who said that? What? Who? Yeah, it's some kind of regularization. Again, the point is, I don't care about this. I care about that. And by making it noisier, I'm basically taking out these non-generic features of the training data. So at first, people thought it was just a computational trick, but in practice, it turns out to be way better than gradient descent, even if you could do the computation, even on small data sets. Because this is our first example of something that serves as a form of what we're calling this, which is regularization, which is that I need tricks, because I don't want to overly optimize for each train. What I care about is about each general. So SGD is the workhorse of modern machine learning, because I also need differentiability for reasons we'll talk about. Everyone okay with that? So learn these words, epic, learning rate, mini-batch. Everyone have those words? So that solved this problem. And we got a bonus. The special bonus, there was yellow chalk, but I can't find it. I got a bonus. Regularization generalizes better. Because I'm basically shaking. One way of thinking about it is that I'm shaking up the potential. Each time it's slightly different. And you're averaging out all the noise, right? And keeping everything. I can shake some more if it'll make people laugh. You guys are like a very somber crowd. Very, very, very, very, very subdued. Yeah, of course. Generally it turns out that you choose of order 100. So anywhere between 30 to 150, but you don't want it to be extensive in the size of the data set. No, no one knows. Open question. Seems to depend on the application. Depends very strongly on the learning rate. Too big is bad. So really you want to put the noise in. Because that regularizes. That seems to be true. So it's always chosen of order 100. It's also, you understand that machine learning is a lot like alchemy. For this audience, biology. If it works, you just keep doing it. You don't ask. It's like, I don't know if you've ever talked to experimental labs. You ask why things are done. They said because it's always done and it always works. So machine learning has a lot of dogma. A lot of dogma. We don't know why. We just know it works. It's not clear anything is the best solution, even though you'll see a 30 paper saying it's the best solution. Because like four years later, someone will come up with another solution. So it's really, it's empirical. I would say it's like alchemy. We have rules of thumb, a lot of guesses. We don't really understand very basic things, I would say. That's my guess. But it works. It's a great tool. Yeah? I'm sorry. Can you say louder? I'll use the first mini-batch for the first gradient. The next time you take the gradient, you go to the next mini-batch. Next time you take the gradient, you go to the next mini-batch. So you move it each time. You're actually taking the gradient with respect to a different potential. And then you just cycle through the whole data set. That's an epic. And you usually do multiple epics, 10, 20. We'll see this tomorrow when we run our Python notebook. Gaussian noise. Yeah, because Gaussian noise seems, It has a lot of physicists spend a lot of time on that. But it's because the nature of the noise is very different. This seems to have, you know, it's not white noise, right? And this noise seems to work better than just adding Langevin noise, which is uncolored. In some sense, the idea is that this noise knows about the correlation structure, right? Because it knows about the correlation structure of the energy landscape. Whereas white noise doesn't. So somehow I think it has something to do with the fact that we'll come to the next thing, which is that even in your noise, you don't want it to treat all directions in the same way. And you can see that this is going to be colored noise. And it's going to know something about the noise is going to be larger in some sense in the directions that are less constrained. Just intuitively. If I have some flat directions, those are in my minima space, those are the directions where you're going to get the most variation. That's basically the idea. Try to make that into a rigorous thing. It never works. I have tons of calculations I've done. They work on like three numerical experiments and fail on the next three. You know, it's like, it's very hard to make this into a rigorous theory. All right. Sorry, I have to see what I'm going next. Okay. So that's stochastic gradient descent. But in practice, often what people want to do is they want to actually, you know, they want, especially when you're doing initial explorations, you want an optimizer that kind of solves these things. I don't want it to be super sensitive to the learning rate because I'll never find it. I want to basically know, is my network pretty good before I start optimizing everything? Right? So in practice, what people do is they've come up with, what they'd like to do is up here in Newton's method, but it's too expensive. Too expensive. Don't want any matrix inverses. But I would still like to, you know, fix some of these things. I'd like to have something that's not that sensitive to what my hyperparameters are, that actually treats different directions differently. So people have come up with what are called essentially, you know, different, different, different things. And so there's two basic tricks that people use, right? So one basic thing about, you know, not getting stuck in too small a local minima is people usually you add something called momentum. And the basic idea now is instead of, once again, you take the mini-batch, right? But now what you do is instead of just updating it all the gradient in this way, is that you give things inertia, right? So the basic idea is what you do is you give, let me see, did I write that right? I did. So what you do is you basically give things inertia, right? So they keep moving in the direction you were giving. You were moving it. Somehow that equation doesn't look right to me. Hold on. I hate writing on the board. Sometimes it's so confusing. Give me one second. I want to write something. I feel like I'm doing something silly. This is the bad part of doing everything off the top of your head. The good part is it makes a better lecture. The bad part is when you get confused. Yeah, there's something wrong about what I wrote down, right? Oh, no, this is why I wrote it down wrong. It doesn't make any sense what I wrote. The momentum is not there. The momentum is here. What you do is you take the gradient and instead of changing it all at once, you, the gradient keeps on going in the direction it's going, and then you update a little bit, right? And alpha is a parameter between 0 and 1, right? Instead of just saying, oh, I'm moving the direction of the local gradient, I'm averaging the gradient over some time period that's set by alpha. Yeah? Sorry? It's just, it's usually chosen, there's some convention I forget. It's the default. It's the, it's the, it's the, it's the default. I don't actually, I've, again, let me write this. Let me use standard notation so I can do it. You do this usually, okay? And gamma is just the time step. It's usually chosen to be 0.7 or something like that. In every machine learning package that I tell you what gamma is, and there's a default, I think no one touches it. All right? There's some best rule of thumb. Who knows if it's the best, but no one touches it, right? But the important point is it's just, remember the gradient over some time period, right? Because you have the last time, and you see this one over gamma is basically setting a time period on which you remember things, right? And the point of this is that we know what momentum does in a ball. If I'm a little hill like this and I'm moving, momentum will get, get me out of these little tiny shallow local minima. All right? Everyone okay with, with that? Okay. So that's the first trick, but that still doesn't solve the problem of treating all the directions in the same, in the same, in the same way, right? And so what people call the, so this is SGD with momentum. And so what people do is what are called second-order moment, second-moment approximations. And what this, these are, they're a way to try to implement Hessians without calculating the full matrix of derivatives. And the most famous one, the one that is the default of everyone uses is something called atom. But there's also another one, which is very common, which is called RMS prep. I think much more likely is, much more nowadays I see much more atom use than everything else. So the default that people go to, if I, when you, you'll see that most people use is this atom. So what is the idea of these atom and RMS prop, right? So the basic idea is not only do I keep track of the mean, right? So here I'm just taking care, track, taking track of the gradient. I also take, I also keep track of the variance of the gradient. All right. And so the simplest version is something like this. So I have to write it. I forgot that I wouldn't have a printer here. So let me write down some gradient, which is del theta mini-batch, then I basically write down a moment m of t is equal to m beta m. These are standard notation, so that's why I'm trying to use standard notation. On minus beta, I am. And then I take care of s of t squared, this. And then I write, it's not going to be really important for all this stuff. And then, roughly speaking, spiritually speaking, what I do now is I say that theta t plus one is equal to theta t minus the moment, minus eta t over m of t over s of t plus epsilon. So this is basically what's going on. So let's walk through these equations. They look very annoying and confusing, but they're very straightforward to understand. So this was just the stochastic gradient descent that we already saw. There's one very clever trick in this and it doesn't work otherwise. This is just, again, just keeping track of the gradient with momentum. So this is saying, okay, my best estimate for the gradient is whatever gradient I got weighted by whatever momentum I had before, at the time step before. Does everyone understand that? This is just averaging. So now I average. I don't update it all at once. I keep track. I'm averaging over multiple time points, set by beta inverse. Then here, I'm not only keeping track of the mean, but I'm also keeping track of the gradient squared. So this is like the variance. This is how much things fluctuate. And then what I do is I basically now update my moment and my parameters by the gradient, which is the mean, divided by something like the standard deviation. This you can think of as trying to estimate the variance of the gradient because it's the second order fluctuation. This is trying to estimate the mean. And the basic idea is now I should step in an adaptive way such that I move the mean divided by the standard deviation because that's the dimensionless view that makes sense. So the standard deviation basically tells me what the natural unit is. And this mean basically allows you to step in that direction. So what MT should I take seriously? The ones that are much, much bigger than the variance. So it's almost like moving in Z-score. And this is another tricky machine learning all the time that you see over and over again. This is what batch normalization is. All these fancy words, a lot of them are because if you normalize things and work in Z-score so that you have natural scales for things, you do better. And this is called Adam. It's not quite this, there's another step that it's not really important for anything. You can look in the notes if you want to. But the idea now is I've taken these things and in particular you see that this is calculated for each direction separately. This variance and this mean. So in each direction I move in a different way and it also implements another form of regularization that seems to be important which is called parameter clipping which basically is the statement I never want to make really big moves in any direction because I don't trust my, I have new stochastic gradient descent I don't trust really big moves. So it allows me to set a natural scale on which to move and eclipse big, big, big moves. So that's why Adam will give you usually pretty good, not the best performance but pretty good off the shelf performance because it's kind of a risk averse algorithm. And I've gotten around, I would really like to use the Hessian but instead I just used the covariance. You know I used to use the second moment. So this is basically how we optimize. Yes, sorry I can't hear you. You have to louder. Oh plus epsilon because this can go to zero. It's just saying there's a minimum variance, you want to regularize. Nothing to become really small and blow up. It just says variances that are much smaller than this I don't take seriously. Second, right? Just a regularizer. Just as a scale, minimum scale on which I'm willing to take the variances seriously. Because sometimes especially when you're in optimization the variances can get very small, numbers can get very small and you just like ignore that. Just a regularizer. Cuts off small things. So that's the first section of all this which is the optimizer. And this is going to be really important because it's the differentiability that's really important for all this. And you see all these things are going to require us to be able to take derivatives with respect to parameters. This is all, I haven't said anything. I just told you if I could take a derivative how should I move in parameter space? Everyone see that? This is why this is a flexible part of every machine learning algorithm. The fun thing is, everything is modular. Yes, Antonio? Yeah. Yes. Yeah, I think what it allows you to do is it allows you to choose a slightly larger learning rate but it still suffers from the gradient descent. Because you're not, remember the problem, the biggest problem was that you were the smallest learning rate, right? By adaptively putting this thing, you do it like that. What's interesting is that you really need these to have the same dimensions. So a lot of people were saying it wasn't working and they were coding it up wrong or actually it turned out. So it's really the signal to noise ratio. But I think the way you should think about it is you're allowed to use much bigger learning rate. You become much less sensitive to learning rate because you're working in a dimensional ratio and you're doing that. You're not really solving the correlation problem. You're solving is the learning rate problem. Yeah, I mean, that's how I think about it. All right. So the second ingredient we need is the cost function, right? So the cost function, how much time do I have? 15 minutes. All right. This ends at 11. 12. 12. 12? 12. 12. Okay. Okay. No problem. All right. Let me clean some boards up. All right. So there's basically two common cost functions. The cost function is, again, whatever you want to do. We're just going to talk about the simplest classification problem. But almost at the top of all classification problems, right, is you have two kinds of data. You have categorical data, which means things are categories, right? Cat, not cat. MNIST 0 through 9. Or you have continuous data. I hope that's spelled right. I don't know. Can't tell. I'm slightly dyslexic and whenever I'm on a board, I just can't see anything. So categorical data for continuous data, the most common thing people use is to have some prediction that's continuous. And you just, again, the most common thing you use is least square error. All right? But you can use anything you want. The important point really is that you want whatever cost function, right, the cost function in general is going to be something that depends on y and something that depends on your predictions. So this y hat is your prediction. That's just a general standard thing. And you want to make sure this cost function, right, because of the chain rule, you want the cost function to be differentiable with respect to y hat, because that's how I take the derivative with respect to theta, from chain rule. So any differentiable cost function works. And people have clever, in more complicated things, people have more, you know, whatever, more complicated things, but generally these squares will give you most of the intuition you need. Categorical data is a little bit more fun. All right? And it's actually very physics inspired. So in categorical data, and this is, you know, the inspiration for this is, of course, regression. So who knows what, if, like, off the shelf statistics that I have to do categorical data, what kind of statistical model? Potts. Huh? Potts? That's very hard, not usually, but... It's a good guess. It's something like a potts model, actually, but it's logistic regression. How many of you have heard of logistic regression? Okay, you should go look up logistic regression. It's not going to be really important for what we do, because... Okay? So the standard thing is logistic regression. So let me tell you how logistic regression works. And in general, logistic regression slash softmax regression. So this is usually two variables, two categories, but this is general, many categories, say, M categories. I was going to explain to you the relationship between this and maximum likelihood, but I'm hoping it was already done. There's actually fun things that we're not going to get to, because it makes the regularizers make much more sense. But the basic idea is that imagine I have some data set, D, and now my data set kind of consists of, you know, some features that I measure, X, you know, the pixels in MNIST or whatever you want to say, and some variables, Y, and in general, it's useful to just say Y, you know, so I have some features, Xi, some predictions, and let's first start with the case where it's a binary variable, 0 or 1. And the basic idea of all this stuff is that I basically defined the probability of Yi equals 1 given some X to be equal to 1 plus e to the minus W dot X, right? Oh, let me use theta. Sorry, let me use these parameters. Right? And let me use, let me use it like this. Theta dot X. So my parameters are just take my theta dot X and just do 1 plus that. That's the probability of being 1. So instead of thinking about predicting either 0 or 1, generically you make this problem what's called soft. So instead you just try to predict the probability that I'm in each of these categories and many of you will recognize this as a Fermi function, right? It's a two level system and this theta is the difference in the energies of the two levels, right? What does this function look like? It kind of looks like this, right? It's a Fermi function. It's a sigmoid, right? And of course the probability of 1 is just 1 minus the probability of Yi equals 1 given X. Right? So this is how I define it. And now what I do is write now what I can do is something that Spinaurny talked about a lot in the lectures is I can do maximum likelihood estimation, right? I can say what's the probability of observing the data given my parameters? Well it's just going to turn out so if I call this sigma of sigma of X. X.theta like this, right? So sigma is just this sigmoid function. Then it's just going to be sigma to the product over all the data points that I have times sigma of theta.x to the Yi times sigma of 1 minus sigma of theta.x to the 1 minus Yi. Why? Because Yi is either 0 or 1 so the probability to absorb a data point is either theta or 1 minus theta and you see one of these terms is always 0 because if Yi is equal to 1 that I only get this thing which is the probability Yi equals 1 but if Yi is equal to 1 or 0 then this disappears and I only get this term and since the data points are all independent I can multiply the data point and the data probability of the data set is just a product for each data point. But what I usually do is I don't work with likelihoods I work with long likelihoods. Everyone okay with long likelihoods? So I take the log of this thing and if I take the log of this thing you see what I get is I get that this is equal to i equals 1 to n Yi log of sigma x dot theta plus 1 minus Yi log of 1 minus sigma x dot theta. All I did was take the log likelihood but now look if I look this is where I maximize but the cost function is minus the log likelihood because cost functions are minimized. So minus log likelihood which is the cost function looks like this and hopefully many of you recognize this as an entropy formula. This is like if I have a binary variable the entropy of a binary variable is just minus p log p minus 1 minus p log 1 minus p that's the entropy of a binary variable that can have probability heads. This is something called the cross entropy which is related to the Colbeck-Liebler divergence. Remember the Colbeck-Liebler divergence is pI log q over pI because I know you guys covered this and the part that depends on q is just pI log qI sum over i minus the entropy but this is the part that compares these things so this is called the cross entropy if qI is the only part that has parameter dependence if I'm going to take derivative with respect to it this is the same as taking the derivative with respect to the Colbeck-Liebler divergence right so I mean if qI is the only thing that depends on theta then if I want to take the derivative of the Colbeck-Liebler divergence that's the same as taking the derivative of this thing so this is called the cross entropy and you see that like q is basically I have the probability from the model and I compare that to the deterministic probability of the real data so in some sense it's calculating the KL divergence between the data labels and my labels and this is the standard categorical thing so I'm running out of time so I won't tell you how it generalizes in a very straightforward way to M categories instead of 2 using something called softmax so maybe I have time I can show you how it works so the basic idea again is this is this loss function and it comes from maximum likelihood but it also looks information theoretic which is not a coincidence because you can think of this as a variational free energy doesn't matter if you want all this subtlety you can go read the review for your purposes that's the whenever you see cross entropy it's maximum likelihood it's categorical data it'll show up tomorrow and what's fun is of course I can generalize this to M categories and the basic idea is I use what's basically a partition function so now I have data set again that's not what I wanted to do I wanted to read it like this Y i equals 0 1 2 all the way to M so if I have M categories the Y i can be what I do is generally I make what are called one hot vectors and the other thing you'll see a lot in machine learning but it's very dumb which is that instead of representing this as M numbers I represent this as a vector of length M and if something is in category 1 I put a 1 on the first category and everything else is 0 if something is in category 2 I just basically put a 1 at the second thing and everything else is 0 the fourth thing all the way to the Mth category and what's nice is that in terms of these Y i M so now everything gets 2 indices an index i that runs over data points and an index M that runs from 1 to the number of categories I have and what's nice is just like before now in soft map regression what you do is that you can write the probability then Y M prime equals 1 given X and now what you do is you put a set of parameters for each so now what you have is you have a set of rates for each of the categories so now if I have if I have M if I have NF features so these X's are I have M categories now I have NF times M parameters so you get a different parameter for each category and it's basically a Boltzmann distribution this is the probability of being in a given category it's just a Boltzmann distribution like this and you can go through and the same derivation we did right here you can basically do the same derivation again and you'll get another cross entropy so if we're running out of time maybe I'll just show you oh dang it I'll make it take longer since we're running out of time apparently you can't stop it in the middle oh is it not stopping uh oh I broke the screen anyway I'll just show you the formulas I hate showing bunches of formulas but it looks the same it's not really important for anything you do but I guess I should say in the last five minutes is what kind of regularizers do I do use so we already talked about one form of regularization is that we use SGD but often what people also do is that there's many kind of generic regularizers but one of the things people try to do is set as many parameters to zero as possible and the way you do that is you take your normal cost function that's where it is and it depends on some parameters data and your data and what people often do is they add a penalty for making the parameters too big so what you say is let's the most common functions are alpha equals well let's just start with weight decay so the most common thing is you put an L2 penalty on this thing so you put an extra parameter lambda that becomes another hyper parameter so this thing says the cost goes up whenever the parameters are big and so you try to shrink the parameters to zero and in general you can have the most common ones are L2 penalties or you can have what are called L1 penalties where you take the absolute value of these things these tend to give you sparse things but generally you put extra terms in the cost function to penalize the parameters alright and the basic idea again is because I don't really care about the training error I care about the test error so I want to put in extra terms that basically penalizing using too many parameters right and these are penalties on having too many parameters because if the parameter is big it costs a lot so I have to make sure every time I add an extra parameter the cost function goes down sufficiently and that ratio is set by this lambda which generally is another hyper parameter like the learning rate you have to choose right so this is for some reason these L2 and L1 penalties in the neural network literature are called weight decay so if you ever say if you read a neural network paper you'll see people say I use weight decay that's just adding these penalties alright and there's many more things that are very neural network specific you can get into drop out, batch norm right so I'll just tell you some names of things that are meant to regularize so another thing is drop out another thing is batch norm and these are just different tricks that people have for regularizing but the fundamental reason is that you need all this regularization is because you don't want to overfit the data alright in the review there's a long list anywhere you look up so that's basically we're at the point now where believe it or not you can write where we need about 10 more minutes to explain to you how neural networks work and how blackprop works about 30 minutes the first 30 minutes of the next class because we haven't said anything about this function f yet right except we know that we need the cost function to be differentiable as a function of f and we want f to be differentiable as a function of theta and we want to be computationally we want it to be really computationally easy to calculate these gradients because we have to calculate the gradients with respect to all the parameters all at once every time we take a gradient descent step so we'll start here tomorrow and we'll talk about why neural networks are really good the thing about modern neural networks is they're designed in such a way that it's really easy to take these gradients and calculate them quickly alright and then you plug it into this whole network and the regularizer is in you choose an optimizer and then you go let it run with some mini batches and stuff alright so we'll do that with the python notebooks tomorrow so I think 30 minutes more of explaining this thing then you can write some deep learning models and I promise you if you spend another couple of weeks you can actually understand what's going on in state of the art most of the state of the art stuff is not so hard conceptually you get to empirically play a lot there's a clever idea I don't mean to dismiss the ideas there's often clever ideas but it's the implementation that matters as much as the idea because you can't tell what's a good idea until you really play with it because there's so many things you have to tune you have to tune so many hyperparameters and there's so much hyperparameter tuning and we'll talk about that tomorrow that's what real machine learning is about it's a very empirical field it's numerical experimentation everyone's hungry I'm out of time thanks before we head towards the cafeteria do you have any questions? just one or two if not then we can have some more if you can ask a question during the lunch break