 All right, so as you can see today, we don't have Jan. Jan is somewhere else, having fun. Hi, Jan, okay. So today instead we have Alon DeFazio. He's a research scientist at Facebook, working mostly on optimization. He's been there for the past three years and before he was a data scientist at Ambieta and then a student at the Australian National University. So why don't we give a round of applause to our speaker today? I will be talking about optimization and if we have time at the end, the death of optimization. So these are the topics I will be covering today. Now, optimization is at the heart of machine learning and some of the things I'm gonna be talking about today will be used every day in your role potentially as an applied scientist or even as a research scientist or a data scientist. And I'm gonna focus on the application of these methods particularly rather than the theory behind them. Part of the reason for this is that we don't fully understand all of these methods. So for me to come up here and say this is why it works, I would be oversimplifying things. But what I can tell you is how to use them, how we know that they work in certain situations and what the best method may be to use to train your neural network. And to introduce you to the topic of optimization, I need to start with the worst method in the world, gradient descent. And I'll explain in a minute why it's the worst method. But to begin with, we're gonna use the most generic formulation of optimization. Now, the problems you're gonna be considering will have more structure than this, but it's very useful notationally to start this way. So we talk about a function F. Now, when we're trying to prove properties of our optimizer, we'll assume additional structure on F. But in practice, the structure in our neural networks essentially obey no of the assumptions, none of the assumptions people make in practice. So I'm just gonna start with the generic F. And we'll assume it's continuous and differentiable, even though we're already getting into the realm of incorrect assumptions, since the neural networks most people are using in practice these days are not differentiable. Instead, you have a equivalent sub-differential, which you can essentially plug into all these formulas. And if you cross your fingers, there's no theory to support this, it should work. So the method of gradient descent is shown here. It's an iterative method. So you start at a point K equals zero. And at each step, you update your point, and here we're gonna use W to represent our current iterate, iterate being the standard nomenclature for the point. For your neural network, this W will be some large collection of weights, one weight tensor per layer. But notationally, we kind of squash the whole thing down to a single vector. And you can imagine just doing that literally by reshaping all your vectors, all your tensors to vectors and just concatenating them together. And this method is remarkably simple. All we do is we follow the direction of the negative gradient. And the rationale for this is pretty simple. So let me give you a diagram, and maybe this will help explain exactly why following the negative gradient direction is a good idea. So we don't know enough about our function to do better. This is the high level idea. When we're optimizing a function, we look at the landscape, the optimization landscape, locally. So by optimization landscape, I mean the domain of all possible weights of our network. Now we don't know what's gonna happen if we use any particular weights in our neural network. We don't know if it'll be better at the task we're trying to train it to or worse. But what we do know locally is the point that we're currently at and the gradient. And this gradient provides some information about a direction which we can travel in that may improve the performance of our network, or in this case, reduce the value of our function when minimizing. Here in this setup, this general setup, minimizing a function is essentially training a neural network. So minimizing the loss will give you the best performance on your classification task or whatever you're trying to do. And because we only look at the world locally here, this gradient is basically the best information we have. And you can think of this as descending a valley where you start somewhere horrible, some peaky part of the landscape, the top of the mountain, for instance, and you travel down from there. And at each point you follow the direction near you that has the most, sorry, the steepest descent. And in fact, the method of gradient descent is sometimes called the method of steepest descent. And this direction will change as you move in the space. Now, if you move locally by only an infinitesimal amount, assuming the smoothness that I mentioned before, which is actually not true in practice, but we'll get to that. Assuming the smoothness, this small step will only change the gradient a small amount. So the direction you're traveling in is at least a good direction when you take small steps. And we essentially just follow this path, taking as large steps as we can, traversing the landscape till we reach the valley at the bottom, which is the minimizer of our function. Now, there's a little bit more we can say for some problem classes. And I'm gonna use the most simplistic problem class we can, just because it's the only thing that I can really do any mathematics for on one slide. So bear with me. This class is quadratics. So for a quadratic optimization problem, we actually know quite a bit just based off the gradient. So firstly, a gradient cuts off an entire half of a space. And I illustrate this here with this green line. So we're at that point there where the line starts near the green line. We know the solution cannot be in the rest of the space. This is not true for neural networks, but it's still a generally a good guideline that we wanna follow the direction of negative gradient. There could be better solutions elsewhere in the space, but finding them is much harder than just trying to find the best solution near to where we are. So that's what we do. We try to find the best solution near to where we are. You can imagine this being the surface of the earth where there are many hills and valleys. And we can't hope to know something about a mountain on the other side of the planet, but we can certainly look for the valley directly beneath the mountain where we currently are. In fact, you can think of these functions here as being represented with these topographic maps. This is the same as topographic maps you use that you may be familiar with from the planet earth where mountains are shown by these rings. Now here the rings are representing descent. So this is the bottom of the valley we're showing here, not the top of a hill at the center there. So yes, our gradient knocks off a whole half of the possible space. Now it's very reasonable then to go in the direction following this negative gradient because it's kind of orthogonal to this line that cuts off half the space. And you can see that I've got the indication of orthogonality there, the little square. So the properties of gradient descent depend greatly on the structure of the problem. For these quadratic problems, it's actually relatively simple to characterize what will happen. So I'm gonna give you a little bit of an overview here and I'll spend a few minutes on this because it's quite interesting. And I'm hoping that those of you with some background in linear algebra can follow this derivation. But we're gonna consider a quadratic optimization problem. Now the problem stated in the gray box at the top. You can see that this is a quadratic where A is a positive definite matrix. We can handle broader classes of quadratics and this potentially, but the analysis is most simple in the positive definite case. And the gradient of that function is very simple, of course, it's AW minus B. And yet the solution of this problem has a closed form in the case of quadratics. It's just inverse of A times B. Now what we do is we take the step there, shown in the green box and we just plug it into the distance from solution. So this WK minus one minus W star is a distance from solution. So we wanna see how this changes over time. And the idea is that if we're moving closer to the solution over time, the method is converging. So we start with that distance from solution and we plug in the value of the update. Now with a little bit of rearranging, we can pull the terms, we can group the terms together and we can write B as A inverse, so we can pull, what have I done here? We can pull the W star inside the brackets there. And then we get this expression where it's a matrix times the previous distance to the solution, matrix times previous distance solution. Now, we don't know anything about which directions this quadratic varies most extremely in, but we can just bound this very simply by taking the product of the matrix's norm and the distance of the solution here, this norm at the bottom. So that's the bottom line there. Now, when you're considering matrix norms, it's pretty straightforward to see that you're going to have an expression where the eigenvalues of this matrix are going to be one minus mu gamma or one minus L gamma. Now, the way I get this is I just look at what are the extreme eigenvalues of A, which we call the mu and L. And by plugging these into the expression, we can see what the extreme eigenvalues will be of this combined matrix I minus gamma A. And you have this absolute value here. Now, you can optimize this and get an optimal learning rate for the quadratics, but that optimal learning rate is not robust in practice. You probably don't want to use that. So a simple value you can use is one over L, L being the largest eigenvalue. And this gives you this convergence rate of one minus mu L reduction in distance to solution every step. Do we have any questions here? I know it's a little bit dense. Yes? W times? Ah, yes, it's a substitution from in that gray box. Do you see the bottom line on the gray box? Yeah, that's just by definition, we can solve the gradient. So by taking the gradient to zero, if you see that second line in the box, taking the gradient to zero there, so replace that gradient with zero, and rearranging, you get the closed form solution to the problem here. So the problem with using that closed form solution in practice is we have to invert a matrix. And by using gradient descent, we can solve this problem by only doing matrix multiplications instead. Not that I would suggest you actually use this technique to solve the matrix. As I mentioned before, it's the worst method in the world. And the convergence rate of this method is controlled by this mu over L quantity. Now these are standard notations. So we're going from linear algebra where you talk about the min and max eigenvalue to the notation typically used in the field of optimization. Mu being smallest eigenvalue, L being largest eigenvalue. And this mu over L is the inverse of the condition number. Condition number being L over mu. This gives you a broad characterization of how quickly optimization methods will work on this problem. And these mu and L terms, they don't exist for neural networks. Only in the very simplest situations do we have L exist, and we essentially never have mu existing. Nevertheless, we want to talk about networks being poorly conditioned and well conditioned. And poorly conditioned will typically be some approximation to L is very large and well conditioned. Maybe L is very close to one. So the step size we can select in when we're training depends very heavily on these constants. So let me give you a little bit of an intuition for step sizes. And this is very important in practice. I myself find a lot of my time is spent tuning learning rates, and I'm sure you'll be involved in similar procedure. So we have a couple of situations that can occur. If we use a learning rate that's too low, we'll find that we make steady progress towards the solution. Here we're minimizing a little 1D quadratic. And by steady progress, I mean that every iteration, the gradient stays in roughly the same direction, and you make similar progress as you approach the solution. This is slower than it is possible. So what you would ideally want to do is go straight to the solution. For a quadratic, especially a 1D one like this, that's going to be pretty straightforward. There's going to be an exact step size that'll get you all the way to solution. But more generally, you can't do that. And what you'll typically want to use is actually a step size a bit above that optimal. And this is for a number of reasons. It tends to be quicker in practice, but you have to be very careful because you get divergence. The term divergence means that the iterates will get further away from the solution instead of closer. This will typically happen if you use too large a learning rate. Unfortunately for us, we want to use learning rates as large as possible to get as quick learning as possible. So we're always at the edge of divergence. In fact, it's very rare that you'll see that the gradients follow this nice trajectory where they all point in the same direction until you kind of reach the solution. What almost always happens in practice, especially with gradient descent invariance, is that you observe this zigzagging behavior. Now, we can't actually see zigzagging in million-dimensional spaces that we train neural networks in, but it's very evident in these 2D plots of a quadratic. So here I'm showing the level sets. You can see the numbers or the function value indicated there on the level sets. And when we use a learning rate that is good, not optimal but good, we get pretty close to that blue dot the solution after the 10 steps. When we use a learning rate that seems nicer in that it's not oscillating, it's well-behaved, when we use such a learning rate, we actually end up quite a bit further away from the solution. So it's a fact of life that we have to deal with these learning rates that are stressfully high. It's kind of like a race, right? No one wins a race by driving safely. Our network training should be very comparable to that. So the core topic we want to talk about is actually stochastic optimization. And this is the method that we will be using every day for training neural networks in practice. So stochastic optimization is actually not so different. What we're gonna do is we're going to replace the gradient in our gradient descent step with a stochastic approximation to the gradient. Now in a neural network, we can be a bit more precise here. By stochastic approximation, what we mean is the gradient of the loss for a single data point, single instance, you might wanna call it. So I've got that in the notation here. This function L is the loss of one data point here. The data point is indexed by I. And we would write this typically in the optimization literature as the function fi. And I'm gonna use this notation, but you should imagine fi has been the loss for a single instance I. And here I'm using supervised learning setup where we have data points I labels YI, sorry, data points XI labels YI. The full loss for a function is shown at the top there. It's the sum of all these FIs. Now, let me give you a bit more explanation for what we're doing here. Replacing this full gradient with a stochastic gradient. This is a noisy approximation. And this is how it's often explained in the stochastic optimization setup. So we have this function, the gradient. And in our setup, its expected value is equal to the full gradient. So you can think of a stochastic gradient descent step as being a full gradient step in expectation. Now, this is not actually the best way to view it because there's a lot more going on than that. It's not just gradient descent with noise. So let me give you a little bit more detail. But first, let anybody ask any questions I have here before I move on. Oh, yes. Yeah, I could talk a bit more about that. But yes, so you're right. So using your entire data set to calculate a gradient is here what I mean by gradient descent. We also call that full batch gradient descent just to be clear. Now, in machine learning, we virtually always use mini batches. So people may use the name gradient descent or something when they're really talking about stochastic gradient descent. And what you mentioned is absolutely true. So there are some difficulties with training neural networks using very large batch sizes. And this is understood to some degree. And I'll actually explain that on the very next slide. So let me get to your point first. So the point that answers your question is actually the third point here. The noise in stochastic gradient descent induces this phenomena known as annealing. And the diagram directly to the right of it illustrates this phenomena. So neural network training landscapes have a bumpy structure to them where there are lots of small minima that are not good minima that appear on the path to the good minima. So the theory that a lot of people subscribe to is that SGD and particularly the noise induced in the gradient actually helps the optimizer to jump over these bad minima. And the theory is that these bad minima are quite small in the space. And so they're easy to jump over where a good minima that results in good performance around your network are larger and harder to skip. So does this answer your question? Yes. So besides that annealing point of view there's actually a few other reasons. So we have a lot of redundancy in the information we get from each term's gradient. And using stochastic gradient lets us exploit this redundancy. In a lot of situations the gradient computed on a few hundred examples is almost as good as a gradient computed on the full data set. And often thousands of times cheaper depending on your problem. So it's hard to come up with a compelling reason to use gradient descent given the success of stochastic gradient descent. And this is part of the reason why stochastic gradient descent is one of the best messes we have but gradient descent is one of the worst. And in fact at early stages the correlation is remarkable. The stochastic gradient can be correlated up to a coefficient of 0.999 correlation coefficient to the true gradient at those early steps of optimization. So I want to briefly talk about something you need to know about. I think Yana's already mentioned this briefly but in practice we don't use individual instances in stochastic gradient descent. We use mini batches of instances. So I'm just using some notation here but everybody uses different notation for mini batching so you shouldn't get too attached to the notation. But essentially at every step you have some batch here I'm gonna call it B index with I and you basically use the average of the gradients over this mini batch which is a subset of your data rather than a single instance or the full batch. Now almost everybody will use this mini batch selected uniformly at random. Some people use with replacement sampling but the differences are not important for this purpose. You can use either. And there's a lot of advantages to mini batching. So there's actually some compelling theoretical reasons to not mini batch but the practical reasons are overwhelming. Part of these practical reasons are computational. We make communion and utilize our hardware say at 1% efficiency when training some of the networks we use if we try and use single instances. And we get the most efficient utilization of the hardware with batch sizes often in the hundreds. If you're training on the typical ImageNet data set for instance you don't want to use batch sizes less than about 64 to get good efficiency. Maybe you can go down to 32. But another important application is distributed training and this is really becoming a big thing. So as was mentioned before people were recently able to train ImageNet data set that normally takes two days to train and not so long ago took more than a week to train in only one hour and the way they did that was using very large mini batches. And along with using large mini batches there are some tricks that you need to use to get it to work. It's probably not something that you would cover in an introductory lecture so I encourage you to check out that paper if you're interested. It's ImageNet in one hour. I believe it's Facebook authors I can't recall the first author at the moment. There are some situations where you need to do full batch optimization. Do not use gradient descent in that situation. I can't emphasize enough do not use gradient descent ever. If you have full batch data by far the most effective method that is kind of plug and play you don't have to think about it is known as LBFGS. It's the accumulation of 50 years of optimization research and it works really well. Torture's implementation is pretty good in code that was written 15 years ago that is pretty much bullet proof. You can use either of those. That's a good question. Classically you do need to use the full dataset. PyTorch's implementation actually supports using mini batching. This is somewhat of a grey area in that there's really no theory to support the use of this and it may work well for your problem or it may not. It could be worth trying. There's a whole dataset for each gradient evaluation or probably more likely since it's very rarely you want to do that probably more likely you're solving some other optimization problem that isn't training a neural network but maybe some ancillary problem related and you need to solve an optimization problem without this data point structure that isn't a sum of data points. That was another question. The question was Jan recommended we use mini batches equal to the size of the number of classes we have in our dataset. Why is that reasonable? The answer is that we want mini batches to be representative of the full dataset and typically each class is quite distinct from the other classes in its properties. By choosing a mini batch that contains on average one data one instance from each class in fact we can enforce that explicitly although it's not necessary in terms of batch size we can assume it has the kind of structure of a full gradient so you capture a lot of the correlations in the data you see with the full gradient and it's a good guide especially if you're using training on CPU where you're not constrained too much by hardware efficiency here when training on a CPU batch size is not critical for hardware utilization. It's problem dependent I would always recommend mini batching I don't think it's worth trying size one as a starting point if you're trying to eke out small gains maybe that's worth exploring. Yes there was another question So in the annealing example the question was why is the lost landscape so wobbly and this is actually something that is very a very realistic depiction of actual lost landscape renewal networks they're incredibly wobbly in the sense that they have a lot of hills and valleys and this is something that is actively researched now what we can say for instance is that there is a very large number of good minima and so hills and valleys we know this because neural networks have this combinatorial aspect to them you can re-parameterize a neural network by shifting all the weights around and you can get a neural network that outputs exactly the same output for whatever task you're looking at with all these weights moved around and that corresponds essentially to a different location in parameter space so given that there's an exponential number of these possible ways of rearranging the weights to get the same network you're going to end up with this space that's incredibly spiky, exponential number of these spikes now the reason why these these local minima appear that is something that is still active research so I'm not sure I can give you a great answer there but they're definitely observed in practice and what I can say is they appear to be less of a problem with very like close to state-of-the-art networks so these local minima were considered big problems 15 years ago but so much at the moment people essentially never hit them in practice when using kind of recommended parameters and things like that when you use very large batches you can run into these problems it's not even clear that the poor performance when using large batches is even attributable to these local minima so this is still ongoing research yes the problem is you can't really see this local structure because we're in this million-divisional space there's not a good way to see it I don't know if people might have explored that already I'm not familiar with papers on that but I bet someone has looked at it so you might want to google that yeah so a lot of the advances in neural network design have actually been in reducing this bumpiness in a lot of ways so this is part of the reason why it's not considered a huge problem anymore when it was considered a big problem in the past any other questions? it is hard to see but there are certain things you can do that will make the peaks and valleys smaller certainly and by rescaling some parts of the neural network you can amplify certain directions the curvature in certain directions can be stretched and squashed the particular innovation residual connections that were mentioned they're very easy to see that they smooth out the loss in fact you can kind of draw a line between two points in the space and you can see what happens along that line that's really the best way we have a visualizing million-dimensional spaces by turning them into one dimension you can see that it's much nicer between these two points whatever two points you choose when using these residual connections I'll be talking all about batch norm later in the lecture hopefully I'll answer that question without you but we'll see thanks, any other questions? yes, so LBFGS excellent method it's kind of the consternation of optimization researchers that we still use SGD a method invented in the 60s or earlier it's still state-of-the-art but there has been some innovation in fact only a couple years later but there was some innovation since the invention of SGD and one of these innovations is momentum I'll talk about another later so momentum it's a trick that you should pretty much always be using when you're using stochastic gradient descent it's worth going into this in a little bit of detail you'll often be tuning the momentum parameter in your network and it's useful to understand what it's actually doing when you're tuning it so part of the problem with momentum it's very misunderstood this can be explained by the fact that there's actually three different ways of writing momentum that look completely different and they turn out to be equivalent I'm only going to present two of these ways because the third way is not as well known but it's actually in my opinion the correct way to view it but I don't want to talk about my research here so we'll talk about how it's actually implemented in the packages you'll be using and this first form here is what's actually implemented in PyTorch and other software that you'll be using here we maintain two variables now you'll see lots of papers using different notation here P is the notation used in physics for momentum and it's very common to use that also as the momentum variable when talking about s, s, d with momentum so I'll be following that convention so instead of having a single iterate we now have two iterates P and W and at every step we update both and this is quite a simple update so the P update involves adding to the old P and instead of adding exactly to the old P we kind of damp the old P by constant of less than one so we reduce the old P and here I'm using beta hat as the constant there so that would probably be 0.9 in practice a small amount of damping and we add to that the new gradient so P is kind of this accumulated gradient buffer you can think of where new gradients come in at full value and past gradients are reduced at each step by a certain factor usually 0.9 reduced, reduced, reduced tends to be some sort of running sum of gradients and basically we just modify the stochastic gradient descent step by using this P instead of the negative gradient using P instead of the gradient in the update so it's a two line formula it may be better to understand this by the second form that I put below you've got to map the beta with a small transformation so it's not exactly the same beta between the two methods it's practically the same in practice so these are essentially the same up to reprimandization and this form I think is maybe clearer this form is called the stochastic heavy ball method and here our update still includes the gradient but we're also adding on a multiplied copy of the past direction we traveled in now what does this mean, what are we actually doing here so it's actually not too difficult to visualize and I'm going to kind of use a visualization from a distil publication you can see the address at the bottom there and I disagree with a lot of what they talk about in that document but I like the visualization so let's use that and I'll explain why I disagreed in some regards later but it's quite simple so you can think of momentum as the physical process of momentum those of you who have done introductory physics courses would have covered this so momentum is the property of something to keep moving in the direction that's currently moving in right, if you're familiar with Newton's laws things want to keep going in the direction they're going in, this is momentum and when you do this mapping of physics the gradient is kind of a force that is pushing your iterate which by this analogy is a heavy ball it's pushing this heavy ball at each point so rather than making dramatic changes in the direction we travel at every step which is shown in that left diagram instead of making these dramatic changes we're going to make kind of a bit more modest changes so when we realize we're going in the wrong direction we kind of do a u-turn instead of putting the handbrake on and swinging around and it turns out in a lot of practical problems this gives you a big improvement so here you can see you're getting much closer to the solution by the end of it there's less oscillation and you can see this oscillation so it's kind of a fact of life if you're using gradient descent type methods so here we're talking about momentum on top of gradient descent in the visualization you're going to get this oscillation it's just a property of gradient descent no way to get rid of it without modifying the method and momentum to some degree dampens this oscillation I've got another visualization here which will kind of give you an intuition beta parameter needs to be greater than zero if it's equal to zero you're just doing gradient descent and it's going to be less than one otherwise everything blows up as you start including past gradients with more and more weight over time so it's got to be between zero and one and typical values range from small 0.25 up to like 0.99 so in practice you can get pretty close to one and what happens is the smaller values they result in you changing direction quicker so in this diagram you can see on the left with the small beta as soon as you get close to the solution you kind of change direction pretty rapidly and head towards the solution when you use these larger betas it takes longer for you to make this dramatic turn you can think of it as a car with a bad turning circle it takes you quite a long time to get around that corner and head towards solution now this may seem like a bad thing but actually in practice this significantly dampens the oscillations that you get from gradient descent and that's the nice property of it now in terms of practice I can give you some pretty clear guidance here you pretty much always want to use momentum it's pretty hard to find problems where it's actually not beneficial to some degree now part of the reason for this is just an extra parameter now typically when you take some method and just add more parameters to it you can usually find some value of that parameter that makes it slightly better now that is sometimes the case here but often these improvements from using momentum are actually quite substantial and using a momentum value of 0.9 is really a default value used in machine learning quite often and often in some situations 0.99 may be better so I would recommend trying both values if you have time otherwise just try 0.9 but I have to give you a warning the way momentum is stated in this expression if you look at it carefully when we increase the momentum we kind of increase the step size now it's not the step size of the current gradient so the current gradient is included in the step with the same strength but past gradients become included in the step with a higher strength when you increase momentum now when you write momentum in other forms this becomes a lot more obvious so this form kind of occludes that but you genuinely do when you change momentum you want to change it so that you have your step size divided by 1 minus beta is your new step size so if your old step size was using a certain beta you want to map it through that equation then map it back to get the new step size now this may be a very modest change but if you're going from momentum 0.9 to momentum 0.99 you may need to reduce your learning rate by a factor of 10 approximately to keep the same learning rate and change the momentum parameter that one might work now I want to go into a bit of detail about why momentum works it's very misunderstood and the explanation you'll see in that distilled post is acceleration and this is certainly a contributor to the performance of momentum now acceleration is a topic yes have you got a question the question was is there a big difference between using momentum and there is so momentum has advantages for when using gradient descent as well as stochastic gradient descent so in fact this acceleration explanation about to use applies both in the stochastic and non-stochastic case so no matter what batch size you're going to use the benefits of momentum still are shown now it also has benefits in the stochastic case as well which I'll cover in a slide or two so the other answer is it's quite distinct from batch size and you shouldn't complate them really you should be changing your learning rate when you change your batch size rather than changing momentum and for very large batch sizes there's a clear relationship between learning rate and batch size but for small batch sizes it's not clear so it's problem dependent any other questions before I move on on momentum yes yes it's this blow up so it's actually in the physics interpretation conservation of momentum would be exactly equal to one now that's not good because if you're in a world with no friction and you drop a heavy ball somewhere it's going to keep moving forever it's not going to stop so we need some dampening and this is where the physics interpretation breaks down so you do need some dampening now you can imagine if you use a larger value than one those past gradients get amplified every step so in fact the first gradient you evaluate in your network is not relevant information content wise later in optimization but if you use the beta larger than one it would dominate the step that you're using does that answer your question any other questions about momentum before we move on they are for a particular value of beta it's strictly equivalent it's not very hard to you should be able to do it in like two lines if you try and do the equivalence yourself no the betas are not quite the same but the gamma is the same that's why I use the same notation for it oh yes so that's what I mentioned when you change beta you want to scale your learning rate by the learning rate divided by one over beta so I'm not sure if it appears in this formula it could be a mistake but I think I'm ok here I think it's not in this formula you definitely when you change beta you need to change learning rate as well to keep things balanced what is the third formulation you were mentioning before? iterate averaging form it's probably not worth going over but you can think of it as momentum is basically changing the point that you evaluate the gradient at in the standard form you evaluate the gradient at this w-point in the iterate averaging form you take a running average of the points you've been evaluating the gradient at and you evaluate at that point it's basically instead of averaging gradients you average points it's in a sense dual yes so acceleration now this is something you could spend a whole career studying and it's it's somewhat poorly understood now if you try and read Nesterov's original work on it now Nesterov is kind of the grandfather of modern optimization he practically half the methods we use are named after him to some degree which can be confusing at times and in the 80s he came up with this formulation he didn't write it in this form he wrote it in another form which people realised a while later could be written in this form and his analysis is also very opaque and originally written in Russian it doesn't help for understanding fortunately all those nice people at the NSA translated all of the Russian literature back then so we have access to them and it's actually a very small modification of the momentum step but I think that small modification belittles what it's actually doing it's really not the same method at all what I can say is with Nesterov's form of momentum if you very carefully choose these constants you can get what's known as accelerated convergence now this doesn't apply to neural networks but for convex problems I won't go into details of convexity but some of you may know what that means it's kind of a simple structure for convex problems it's a radically improved convergence rate from this acceleration but only for very carefully chosen constants and you really can't choose these carefully ahead of time so you've got to do quite a large search over your parameters sorry to find the right constants to get that acceleration what I can say is this actually occurs for quadratics when using regular momentum and this is confused a lot of people so you'll see a lot of people say that momentum is an accelerated method it's accelerated only for quadratics and even then it's a little bit iffy I would not recommend using it for quadratics use conjugate gradients or some new methods that have been developed over the last few years and this is definitely a contributing factor to why momentum works so well in practice there's definitely some acceleration going on but this acceleration is hard to realize when you have stochastic gradients now when you look at what makes acceleration work noise really kills it and it's hard to believe that it's the main factor contributing to the performance but it's certainly there and the distilled post I mentioned attributes all the performance of momentum to acceleration but I wouldn't go quite that far but it's definitely a contributing factor but probably more practical and provable reason why noise, sorry, why momentum helps is noise smoothing and this is very intuitive momentum averages gradients in a sense, we keep this running buffer of gradients that we use as a step instead of individual gradients this is kind of a form of averaging it turns out that when you use SGD without momentum to prove anything at all about it, you actually have to work with the average of all the points you visited you can get very weak bounds on the last point that you ended up at but really you've got to work with this average of points and this is suboptimal like we never want to actually take this average in practice it's heavily weighted with points that we visited a long time ago which may be irrelevant and in fact this averaging doesn't work very well in practice for neural networks it's really only important for convex problems but nevertheless it's necessary to analyze regular SGD and one of the remarkable facts about momentum is actually this averaging is no longer theoretically necessary so essentially momentum adds smoothing during optimization that makes it so the last point you visit is still a good approximation to the solution with SGD really you want to average a whole bunch of last points you've seen in order to get a good approximation to the solution now let me illustrate that here so this is a very typical example of what happens when using SGD SGD at the beginning you make great progress the gradient is essentially almost the same as the stochastic gradient so first few steps you make great progress towards solution but then you end up in this ball now recall here that's a valley that we're heading down so this ball here is kind of the floor of the valley and you kind of bounce around in this floor the most common solution of this is if you reduce your learning rate you'll bounce around slower not exactly a great solution but it's one way to handle it but when you use SGD with momentum you can kind of smooth out this bouncing around and you kind of just kind of wheel around now the path is not always going to be this corkscrew tile path it's actually quite random you could kind of wobble left and right but when I seeded it with 42 this is what it sped out so that's what I'm using here you typically get this corkscrew corkscrewing for this set of parameters and yeah I think this is a good explanation so some combination of acceleration and noise smoothing is why momentum works oh yes yes so I should say that when we inject noise here the gradient may not even be the right direction to travel in fact it could be in the opposite direction from where you want to go and this is why you kind of bounce around in the valley there so in fact you can see here that the first step with SGD is practically orthogonal to the set there that's because it is such a good step at the beginning but once you get further down it can point in pretty much any direction vaguely around the solution so SGD with momentum is currently state of the art optimization method for a lot of machine learning problems so you'll probably be using it in your course for a lot of problems but there has been some other innovations over the years and these are particularly useful for poorly conditioned problems as I mentioned earlier in the lecture some problems have this kind of well conditioned property that we can't really characterize for neural networks but we can measure it by the test that if SGD works then it's well conditioned if SGD doesn't work then it must be poorly conditioned so we have other methods we can handle this in some situations and these generally are called adaptive methods now you need to be a little bit careful because what are you adapting to people in the literature use this nomenclature for adapting learning rates adapting momentum parameters but in our situation we're going to talk about a specific type of adaptivity and this adaptivity is individual learning rates now what I mean by that so in the formulation I already showed you stochastic gradient descent I used a global learning rate by that I mean every single weight in your network is updated using an equation with the same gamma now gamma could vary over time steps I used gamma k in the notation but often you use a fixed gamma for quite a long time but for adaptive methods we want to adapt a learning rate for every weight individually and we want to use information we get from gradients for each weight to adapt this so this seems like the obvious thing to do and people have been trying to get this stuff to work for decades and we've kind of stumbled upon some methods that work and some that don't but I want to ask for questions here if there's any explanation needed so I can say that it's not entirely clear why you need to do this right if your network is well conditioned you don't need to do this potentially but often the networks we use in practice have very different structure in different parts of the network so for instance the early parts of your convolutional neural network have very shallow convolutional layers on large images later in the network you're going to be doing convolutions with large numbers of channels on small images now these operations are very different and there's no reason to believe that a learning rate that works well for one would work well for the other and this is why adaptive learning rates can be useful any questions here? unfortunately there's no good definition for neural networks we couldn't measure it even if there was a good definition so I'm going to use it in a vague sense that if actually it doesn't work then it's poorly conditioned so the quadratic case if you recall we have an explicit definition of this condition number L over mu L being maximum eigenvalue mu being smallest eigenvalue and the larger this gap between larger and smaller eigenvalue the worst condition it is this does not imply for neural networks that mu does not exist for neural networks because L still has some information in it but I wouldn't say it's the determining factor there's just a lot going on so there are some ways that neural networks behave a lot like simple problems but there are other ways where we just kind of hand wave and say that they like them yeah yeah so for this particular network this is a network that actually isn't too poorly conditioned already in fact this is a VGD16 which is practically the best network when you had a train before the invention of certain techniques to improve conditioning so this is almost the best condition you can actually get and a lot of the structure of this network is actually defined by this conditioning we double the number of channels after certain steps because that seems to result in networks that are well conditioned rather than any other reason but it's certainly what you can say is that weights very late in the network have very large effect on the output that very last layer there are 4,096 weights in it that's a very small number of weights this network has millions of weights I believe those 4,096 weights have a very strong effect on the output because they directly dictate that output and for that reason you generally want to use smaller learning rates for those whereas yeah weights early in the network some of them might have a large effect but especially when you initialize your network randomly they typically will have a smaller effect of those earlier weights and this is very hand wavy and the reason why is because we really don't understand this well enough for me to give you a precise a precise statement here 120 million yeah 120 million weights in this network actually so yeah so that last layer is like 4,096 by 4,096 matrix so yeah okay any other questions yeah yes I would recommend only using them when your problem doesn't have a structure that decomposes into a large sum of similar things that's a bit of a mouthful but SUD works well when you have an objective that is a sum where each term in this sum is vaguely comparable so in machine learning each term in this sum is a loss of one data point and these have very similar structures the individual losses that's a hand wavy sense that they have very similar structure because of course each data point could be quite different but when your problem doesn't have a large sum as the main part of its structure then LBFGS would be useful that's the general answer I doubt you will make use of it in this course LBFGS I doubt it but it can be very handy for small networks you can experiment around with it with the LiNet5 network or something which I'm sure you'll probably use in this course you could experiment with LBFGS probably and have some success there one of the kind of founding techniques in modern neural network training is RMSPROP and I'm going to talk about this here now at some point kind of the standard practice in the field of optimisation as in research in optimisation kind of diverged with what people were actually doing when training neural networks and this RMSPROP was kind of the fracturing point where we all went off in different directions and this RMSPROP is usually attributed to Jeffrey Hinton Slides an unpublished paper from someone else which is really unsatisfying to be citing someone's slides in a paper but anyway it's a method that has some it has no proof behind why it works but it's similar to methods that you can prove work so that's at least something and it works pretty well in practice and that's why a lot of people use it so I wanted to give you that kind of introduction before I explain what it actually is and RMSPROP stands for root word propagation this was from the era where everything we did with neural networks we called propagation such and such like backprop which now we call deep so it would probably be called RMS DeepPROP or something if it was invented now and it's a little bit of a modification so it's still a two line algorithm but a little bit different so I'm going to go over these terms in some detail because it's important to understand this now we keep around this V buffer now this is not a momentum buffer so we're using different notation here V is doing something different and I'm going to use some notation that some people really hate but I think it's convenient I'm going to write the element wise square of a vector just by squaring the vector this is not really confusing notation in almost all situations but it's a nice way to write it so here I'm writing the gradient squared whatever it is and square each element individually so this V update is what's known as an exponential moving average I do want to have a quick show of hands who's familiar with exponential moving averages I want to know if I need to talk about it in some more seems like it's probably need to explain it in some depth but an exponential moving average it's a standard way this has been used for many many decades across many fields for maintaining an average of a quantity that may change over time so when a quantity is changing over time we need to put larger weight on newer values because they provide more information and one way to do that is down weights old values exponentially when you do this exponentially you mean that the weight of an old value from say 10 steps ago will have weight alpha to the 10 in your thing so that's where the exponential comes in the alpha to the 10 now that's not written in the notation in the notation at each step we just down weight the past vector by this alpha constant and if you can imagine in your head things in that buffer the V buffer that are very old at each step they get down weighted by alpha at every step and just as before alpha here is something between 0 and 1 so we can't use values greater than 1 there so this will damp those old values until they're no longer part of the exponential moving average so this method keeps an exponential moving average of the second moment I mean from central second moment so we do not subtract off the mean here the pie torch implementation has a switch where you can tell it to subtract off the mean play with that if you like it will probably perform very similarly in practice there's a paper on that I'm sure but the original method does not subtract off the mean there and we use this second moment to normalize the gradient and we do this element wise so all this notation is element wise every element of the gradient is divided through by the square root of the second moment estimate and you should think of the square root as really being the standard deviation even though this is not a central moment so it's not actually the standard deviation it's useful to think of it that way and the name root mean square is kind of alluding to that division by the root of the mean of the squares and the important technical detail here you have to add epsilon here for the annoying problem that when you divide zero by zero everything breaks so you occasionally have zeros in your network there are some situations where it makes a difference outside of when your gradient is zero but you absolutely do need that epsilon in your method and you'll see this as a recurring theme all these adaptive methods basically you've got to put an epsilon whenever you divide something just to avoid dividing by zero and typically that epsilon will be close to your machine epsilon I don't know if you are familiar with that term but it's something like 10 to the negative 7 sometimes 10 to the negative 8 so really it only has a small effect on the value before I talk about why this method works I want to talk about the most recent kind of innovation on top of this method and that is the method that we actually use in practice so rmsprop is sometimes still used but more often we use a method known as atom and atom means adaptive moment estimation so atom is rmsprop with momentum so I spent 20 minutes telling you why you should use momentum so I'm going to say we should put it on top of rmsprop as well and there's a lot of ways of doing that at least half a dozen and there's papers for each of them but atom is the one that caught on and the way we do momentum here is we actually convert the momentum update to an exponential moving average as well now this may seem like a quantitatively different update like we're doing momentum by a moving average in fact what we were doing before is essentially equivalent to that you can work out some constants where you can get a method where you use an exponential moving average momentum that is equivalent to the regular momentum so don't think of this moving average momentum as being anything different than your previous momentum but it has a nice property that you don't need to change the learning rate when you mess with the beta here which I think is a big improvement so yeah we add momentum of the gradient and just as before with rmsprop we have this exponential moving average of the squared gradient on top of that we basically just plug in this moving average gradient where we had the gradient in the previous update so it's not too complicated now if you actually read the atom paper you'll see a whole bunch of additional notation the algorithm is like 10 lines long instead of 3 and that is because they add something called bias correction this is actually not necessary but it will help a little bit so everybody uses it and all it does is it increases the value of these parameters during the early stages of optimization and the reason you do that is because you initialize this momentum buffer at 0 typically now imagine you initialize it at 0 then after the first step we're going to be adding to that a value of 1-beta times the gradient will typically be 0.1 because we typically use momentum 0.9 so when we do that our gradient step is actually using a learning rate 10 times smaller because this momentum buffer has a tenth of a gradient in it and that's undesirable so all the bias correction does is just multiply by 10 the step in those early iterations and the bias correction formula is just basically the correct way to do that to result in a step that's unbiased unbiased here means just the expectation of the momentum buffer is the gradient so it's nothing too mysterious don't think of it as being like a huge addition although I do think that the atom paper was the first one to use bias correction in a mainstream optimization method I don't know if they invented it but it certainly pioneered the bias correction so these methods work really well in practice let me just give you an empirical comparison here now this quadratic I'm using is a diagonal quadratic so I'm repeating to use a method that works well on diagonal quadratics on a diagonal quadratic but I'm going to do that anyway and you can see that the direction they travel is quite an improvement over SGD so in this simplified problem SGD kind of goes in the wrong direction at the beginning where RMS prop basically heads in the right direction now the problem is RMS prop suffers from noise just as regular SGD without noise suffers so you get this situation where it kind of bounces around the optimum quite significantly and just as with SGD with momentum when we add momentum to atom we get the same kind of improvement where we kind of corkscrew or sometimes reverse corkscrew around the solution that kind of thing and this gets you to the solution quicker and it means that the last point you're currently at is a good estimate of the solution not a noisy estimate but it's kind of the best estimate you have so I would generally recommend using atom over RMS prop and it's certainly the case that you just can't use SGD atom is necessary for training some of the neural networks we use in our language models say the our language models it's necessary for training the networks I'm going to talk about near the end of this presentation it's generally if I have to recommend something you should use you should try either SGD with momentum or atom as your go-to methods for optimizing neural networks so there's some practical advice for you personally I hate atom because I'm an optimization researcher and the theory in their paper is wrong this has been shown recently so the method in fact does not converge and you can show this on very simple test problems so one of the most heavily used methods in modern machine learning actually doesn't work in a lot of situations this is unsatisfying and it's kind of an ongoing research question of the best way to fix this I don't think just modifying atom a little bit to try and fix it is really the best solution I think it's got some more fundamental problems but I won't go into any detail for that there is a very practical problem I need to talk about though atom is known to sometimes give worse generalization error I think Jan has talked in detail about generalization error do I should I go over that so generalization error is the error on data that you didn't train your model on basically so your networks are very heavily over-parametized and if you train them to give zero loss on the data you trained it on they won't give zero loss on other data points data that it's never seen before and this generalization error is that error typically the best thing we can do is minimize the loss on the data we have but sometimes that's sub-optimal and it turns out when you use atom it's quite common particularly on image problems that you get worse generalization error than when you use SGD and people attribute this to a whole bunch of different things it may be finding those bad local minima that I mentioned earlier the ones that are smaller it's kind of unfortunate that the better your optimization method the more likely it is to hit those small local minima because they're closer to where you currently are it's the goal of an optimization method to find you the closest minima in a sense these local optimization methods we use but there's a whole bunch of other reasons that you could attribute to it less noise in atom perhaps it could be some structure maybe these methods where you rescale space like this have this fundamental problem where they give worse generalization we don't really understand this but it's important to know that this may be a problem in some cases it's not to say that it won't give high level performance you'll still get a pretty good neural network at the end and what I can tell you is language models that we train at Facebook use methods like atom or atom itself and they give much better results than if you use SGD and there's a kind of a small thing that won't affect you at all I would expect but with atom you have to maintain these three buffers whereas SGD you have two buffers of parameters this doesn't matter except when you're training a model that's like 12 gigabytes and then it really becomes a problem I don't think you'll encounter that in practice training is a little bit iffy so you've got to tune two parameters instead of one so yeah that's practical advice use atom or SGD but onto something that is also it's also kind of a core thing oh sorry I have a question yes yes you're absolutely correct but typically, oh yes the question the question was won't using a small epsilon in the denominator result in blow up certainly if the numerator was equal to roughly one then dividing by 10 to the negative 7 could be catastrophic and this is a legitimate question but typically in order for the V buffer to have very small values the gradient also has to have had very small values you can see that from the way the exponential moving averages are updated so in fact it's not a practical problem when this V is incredibly small the momentum is also very small and when you're dividing small thing by a small thing you don't get blow up yeah so the question is should I run SGD and atom separately at the same time and just see which one works better in fact that is pretty much what we do because we have lots of computers we just have one computer run SGD, one computer run atom and see which one works better although we kind of know for most problems which one is the better choice for whatever problems you're working with maybe you can try both it depends how long it's going to take to train I'm not sure exactly what you're going to be doing but there's a lot of practice in this course but yeah certainly legitimate way to do it in fact some people use SGD at the beginning and then switch to atom at the end that's certainly a good approach it just makes it more complicated and complexity should be avoided if possible yes this is one of those deep unanswered questions so the question was should we run SGD with lots of different initializations and see which one gives the best solution won't that help with the bumpiness this is the case with small neural networks with different solutions depending on your initialization now there's a remarkable property of the kind of large networks we use at the moment, stay at the art networks as long as you use similar random initialization in terms of the variance of your initialization you'll end up practically at similar quality solutions and this is not well understood so yeah it's quite remarkable that your neural network can train for 300 epochs and you end up with solution the test error is like almost exactly the same as what you got in a completely different initialization we don't understand this so if you really need to eke out tiny performance gains you may be able to get a little bit better network by running multiple and picking the best and it seems the bigger your network and the harder your problem the less gain you get from doing that yes the question was we have three buffers for each weight the answer is yes so essentially we basically in memory we have a copy of the data so our weight will be a whole bunch of tensors in memory we have a separate whole bunch of tensors that are momentum tensors and we have a whole bunch of other tensors that are the second moment tensors so yeah so normalization layers so this is kind of a clever idea why try and come up with a better optimization algorithm when we can just come up with a better network and this is the idea so modern neural networks typically we modify the network by adding additional layers in between existing layers and the goal of these layers are to improve the optimization and generalization performance of the network and the way they do this can happen in a few different ways but let me give you an example so we would typically take standard kind of combination so as you know in modern neural networks we typically alternate linear operations with nonlinear operations here I call that activation functions we alternate them linear, nonlinear, linear what we can do is we can place these normalization layers either between the linear or the nonlinear or before so in this case we are using for instance this is the kind of structure we have in real networks where we have a convolution recall that convolutions are linear operations followed by batch normalization this is a type of normalization which I will detail in a minute followed by relu which is currently the most popular activation function and we place this normalization between these existing layers and what I want to make clear is these normalization layers they affect the flow of data through so they modify the data that is flowing through but they don't change the power of the network in the sense that you can set up the weights in the network and it will still give whatever output you had in an unnormalized network with a normalized network so normalization layers do not make your network more powerful they improve it in other ways normally when we add things to a neural network the goal is to make it more powerful and yes this normalization layer can also be after the activation or before the linear or because this wraps around we do this in order any questions here? yes so that is certainly true but we kind of want that we want the relu to censor some of the data but not too much but it's also not quite accurate because normalization layers can also scale and shift the data and so it won't necessarily be that although certainly at initialization they do not do that scaling and shift so it will typically cut off half the data and in fact if you are trying to do a theoretical analysis of this it's very convenient that it cuts off half the data so the structure of these normalization layers they all pretty much do the same kind of operation and I'm going to use kind of generic notation here so you should imagine that X is an input to the normalization layer and Y is an output and what you do is you do a whitening or normalization operation where you subtract off some estimate of the mean of the data and you divide through by some estimate of the standard deviation and remember before that I mentioned we want to keep the representational power of the network the same what we do to ensure that is we multiply by an alpha and we add an A and we add a B and this is just so that the layer can still output values over any particular range if we just always had every layer output in whitened data the network couldn't output like a value million or something like that it could only do that in very rare cases because that would be very heavy on the tail of the normal distribution so this allows our layers to essentially output things that have the same range as before and yes so normalization layers have parameters and the network is a little bit more complicated in the sense it has more parameters it's typically a very small number of parameters like a rounding error in your counts of network parameters typically and so the complexity of this is I'm being kind of vague about how you compute the mean and standard deviation the reason I'm doing that is because all the methods compute it in a different way and I'll detail that in a second yes question it's just a shift parameter so the data could have had a non-zero mean and we want the layer to be able to produce outputs with a non-zero mean so if we always just subtract off the mean it couldn't do that so that's the representational power to the layer yes so the question is don't these A and B parameters reverse the normalization and in fact that often is the case that they do something similar but they move at different timescales so between steps or between evaluations of your network the mean variance can shift quite substantially based off the data you feed in but these A and B parameters are quite stable they move slowly as you learn them so since they're more stable this has beneficial properties and I'll describe those a little bit later but what I want to talk about is exactly how you normalize the data and this is really the crucial thing so the earliest of these methods developed was batch norm and it uses kind of a bizarre normalization that I think is a horrible idea but unfortunately it works fantastically well so it normalizes across batches so we want information about a certain channel recall for a convolutional neural network a channel is one of these latent images that you have in your network partway through the network you have some data it doesn't really look like an image if you actually look at it but it's shaped like an image that's a channel so we want to compute an average over this channel but we only have a small amount of data of what's in this channel basically height times width if it's an image and it turns out that's not enough data to get good estimates so what batch norm does is it takes a mean and variance estimate across all the instances in your mini batch pretty straight forward and that's what it divides blue by the reason why I don't like this is it's no longer actually stochastic gradient descent if you're using batch normalization so it breaks all the theory that I work on for a living so I prefer some other normalization strategies that in fact quite soon after batch norm people try normalizing via every other possible combination of things you can normalize by and it turns out the three that kind of work are layer instance and group norm and layer norm here in this diagram you average across all the channels and across height and width now this doesn't work on all problems so I would only recommend it on a problem where you know it already works and that's typically a problem where people are already using it so look at the networks people are using if that's a good idea or not will depend this is normalization is something that's used a lot in modern language models and this you do not average across the batch anymore which is nice I won't talk about that in much depth really the one I would rather you use in practice is group normalization so here we average across a group of channels and this group is chosen arbitrarily and fixed at the beginning so typically we just group things numerically so channel 0 to 10 would be a group channel you know 10 to 20 making sure you don't overlap of course disjoint groups of channels and the size of these groups is a parameter that you need to tune although we always use 32 in practice you could tune that and you just do this because there's not enough information on a single channel and using all the channels is too much so you just use something in between it's really quite a simple idea and it turns out this group norm often works better than batch norm and a lot of problems and it does mean that my STD theory that I work on is still valid so I like that so why does normalization help? this is a matter of dispute so in fact in the last few years several papers have come out on this topic unfortunately the papers did not agree on why it works they all have completely separate explanations but there's some things that are definitely going on so we can say for sure that the network appears to be easier to optimize so by that I mean you can use large learning rates in a better condition network you can use larger learning rates and therefore get faster convergence so that does seem to be the case when you use these normalization layers another factor which is a little bit disputed but I think is reasonably well established you get noise in the data passing through your network when you use normalization in batch norm this noise comes from other instances in the batch because it's random what other instances are in your batch when you compute the mean using those other instances that mean is noisy and this noise is then added or sorry subtracted from your weights when you do the normalization operation so this noise is actually potentially helping generalization performance in your network now there's been a lot of papers on injecting noise into networks to help generalization so it's not such a crazy idea that this noise can be helping and in terms of a practical consideration this normalization makes the weight initialization that you use a lot less important it used to be kind of a black art to select the initialization in your network and the people who had really good models often it was just because they were really good at tuning their initialization and this is just less the case now when we use normalization layers and also gives the benefit of you can kind of tile together layers with impunity so again it used to be the situation that if you just plug together two possible layers in your network it probably wouldn't work now that we use normalization layers it probably will work even if it's a horrible idea and this has spurred a whole field of automated architecture search where they just randomly combo together blocks and try thousands of them and see what works and that really wasn't possible before because that would typically result in a poorly conditioned network you couldn't train with normalization typically you can train it some practical considerations so the bachelor on paper one of the reasons why it wasn't invented earlier is the kind of non-obvious thing that you have to back propagate through the calculation of the mean and standard deviation if you don't do this everything blows up now you won't have to do this yourself as it will be implemented in the implementation that you use I do not have the expertise to answer that I feel like it's kind of sometimes it's just a pet method like people like layering instance norm in that field more and in fact group norm if it's just the group size covers both so I would be sure that you could probably get the same performance using group norm with a particular group size chosen carefully the choice of bachelor does affect parallelization so the implementations in your CUDA library or your CPU library are pretty efficient for each of these but it's complicated when you're spreading your computation across machines and you kind of have to synchronize these things and batch norm is a bit of a pain there because it would mean that you need to compute an average across all machines and aggregate it whereas if you're using group norm every instance is on a different machine you can just separately compute the norm so in all those other three it's separate normalization for each instance it doesn't depend on the other instances in the batch so when you're distributing it turns out when people use batch norm on a cluster they actually do not sync the statistics across which makes it even less like SGD and makes me even more annoyed batch norm basically has a lot of momentum not in the optimization sense but in the sense of people's minds so it's very heavily used for that reason but I would recommend group norm instead and there's kind of like a technical detail with batch norm you don't want to compute these mean and standard deviations on batches during evaluation time by evaluation time I mean when you actually run your network on the test data set or use it in the real world for some application it's typically in those situations you don't have batches anymore batches are more of a training thing so you need some substitution in that case you can compute an exponential moving average as we talked about before an EMA of these mean and standard deviations you may think to yourself why don't we use an EMA in the implementation of batch norm the answer is because it doesn't work it seems like a very reasonable idea though and people have explored that in quite a lot of depth but it doesn't work oh yes this is quite crucial so people have tried normalizing things in neural networks before batch norm was invented but they always made the mistake of not back-popping through the mean and standard deviation and the reason why they didn't do that is because the math is really tricky and if you try and implement it yourself it will probably be wrong now that we have PyTorch which computes gradients correctly for you in all situations you can actually do this in practice now I just a little bit but only a little bit because it's surprisingly difficult so the question is is there a difference if we apply normalization before after the nonlinearity and the answer is there will be a small difference in the performance of your network now I can't tell you which one is better because it appears in some situations one works a little bit better and other situations the other one works better what I can tell you is the way I drew it here is what's used in the PyTorch implementation of ResNet and most ResNet implementations so it's probably almost as good as you can get I think they would use the other form if it was better and it's certainly problem dependent this is another one of those things where maybe there's no correct answer of how you should do it and it's just random which works better I don't know any other questions on this before I move on to the you need more data to get accurate estimates of the mean and standard deviation the question was why is it a good idea to compute it across multiple channels rather than a single channel and yes it is because you just have more data to make better estimates but you want to be careful you don't want to have too much data in that because then you don't get the noise and recall that the noise is actually useful and the group size and group norm is just adjusting the amount of noise we have basically the question was how is this related to group convolutions this was all pioneered before group convolutions were used it certainly has some interaction with group convolutions if you use them so you want to be a little bit careful there I don't know exactly what the correct thing to do is in those cases but I can tell you they definitely use normalization in those situations probably batch norm more than group norm because of the momentum I mentioned it's probably more popular batch norm yes so the question is do we ever use other instances from the mini batch in group norm or is it always just a single instance we always just use a single instance because there's so many benefits to that it's so much simpler in implementation and in theory to do that maybe you can get some improvement from that in fact I bet you there's a paper that does that somewhere because they've tried every combination of this in practice I suspect if it worked well we'd probably be using it the death of optimization I wanted to put something a little bit interesting because you've all been sitting through kind of a pretty dense lecture so this is something that I've kind of been working on a little bit that I thought you might find interesting so you might have seen the XKCD comic here that I've modified it's not always this way it's kind of the point I want to make so sometimes we can just barge into a field we know nothing about and improve on how they're currently doing it although you have to be a little bit careful so the problem I want to talk about is one that Jan I think mentioned briefly in the first lecture but I want to go into a bit of detail it's MRI reconstruction now in the MRI reconstruction problem we take raw data from an MRI machine a medical imaging machine we take raw data from that machine and we reconstruct an image and there's some pipeline an algorithm in the middle there that produces the image and the goal basically here is to replace 30 years of research into what algorithm they should use there with neural networks because that's what I don't get paid to do sorry and I'll give you a bit of detail so these MRI machines capture data in what's known as the Fourier domain I know a lot of you have done signal processing some of you may have no idea what this is you don't need to understand it for this problem oh yeah I saw we use a spec program yes so you may have seen the Fourier domain in one dimensional case so for neural networks for MRI reconstruction we have two dimensional Fourier domain the thing you need to know is it's a linear mapping to get from the Fourier domain to image domain it's just linear and it's very efficient to do that mapping it literally takes milliseconds no matter how big your image is on modern computers it's easy to convert back and forth between the two and the MRI machines actually capture either rows or columns of this Fourier domain as samples they call it sample and literature so each time that machine computes a sample which is every few milliseconds it gets a roll column of this image and this is actually technically a complex valued image but this does not matter for my discussion of it so you can imagine it's just a two channel image it's a real and imaginary channel just think of them as color channels the problem we want to do we want to solve is accelerating MRI acceleration here is in the sense of faster so we want to run the machines quicker and produce identical quality images and one way we can do that and the most successful way so far is by just not capturing all of the columns we just skip some randomly it's useful in practice to also capture some of the middle columns it turns out they contain a lot of the information but outside the middle we just capture randomly and we can't just use our nice linear operation anymore that diagram on the right is the output of that linear operation I mentioned applied to this data so it doesn't give useful output we've got to do something a little bit more intelligent any questions on this before I move on it is frequency and phase dimensions so in this particular case I'm not actually showing this diagram two dimensions is frequency and one is phase and the value is the magnitude of a sine wave with that frequency and phase so if you add together all the sine waves wave them with the frequency with the weight in this image you get the original image so it's a little bit more complicated because it's in two dimensions and the sine waves you've got to be a little bit careful but it's basically just each pixel is the magnitude of a sine wave compared to a 1D analogy you'll just have frequencies so the pixel intensity is the strength of that frequency if you have a musical note say a piano note with a C major as one of the frequencies that would be one pixel and this image would be the C major frequency and another might be a minor or something like that and the magnitude of it is just how hard they press the key on the piano so yeah frequency information so the linear doesn't work there was one of the biggest breakthroughs in theoretical mathematics for a long time was the invention of compressed sensing I'm sure some of you have heard of compressed sensing show of hands compressed sensing some of you especially work in the mathematical sciences would be aware of it basically there's this phenomenal theoretical paper that showed that we could actually in theory get a perfect reconstruction from these sub-sampled measurements and we had some requirements for this to work the requirements were that we needed to sample randomly in fact it's a little bit weaker you have to sample incoherently but in practice everybody samples randomly so it's essentially the same thing now here we're randomly sampling columns but within the columns we do not randomly sample the reason being is it's not faster in the machine the machine can capture one column as quickly as it could capture half a column so we just capture a whole column but no longer randoms that's one kind of problem with it the other problem is kind of the assumptions of this compressed sensing theory are violated by the kind of images we want to reconstruct I'll show you on the right there an example of compressed sensing theory reconstruction this was a big step forward from what they could do before you would get something that looked like this previously that was really considered the best in fact some people would when this result came out swore that this was impossible but it's actually not but you need some assumptions and these assumptions are pretty critical and I mentioned them there so you need sparsity of the image now that MRI image is not sparse by sparse I mean it has a lot of zero or black pixels it's clearly not sparse but it can be represented sparsely or approximately sparsely if you do a wavelet decomposition now I won't go into the details there's a little bit of a problem though it's only approximately sparse when you do that wavelet decomposition it's very sparse in the wavelet domain and perfectly that would be exactly the same as the left image and this compressed sensing is based on the field of optimization it kind of revitalized a lot of the techniques people have been using for a long time the way you get this reconstruction is you solve a little mini optimization problem at every step every image you want to reconstruct come out of the machine so your machine has to solve an optimization problem for every image every time it solves this little problem with this kind of complicated regularization term so this is great for optimization all these people who had been getting low paid jobs at universities all of a sudden their research was trendy and corporations needed their help so this is great but we can do better so instead of solving this minimization problem at every time step we use a big neural network so I'm using B here arbitrarily to represent a huge neural network and we hope that we can learn a neural network of such sufficient complexity that it can essentially solve the optimization problem in one step it just outputs a solution that's as good as the optimization problem solution now this would have been considered impossible 15 years ago now we know better so it's actually not very difficult in fact we can just take an example of we can solve a few of these a few I mean like a few hundred thousand of these optimization problems take the solution and the input and we can just train a neural network to map from input to solution that's actually a little bit suboptimal because we can in some cases we know a better solution than the solution to the optimization problem we can gather that by measuring the patient and that's what we have to do in practice so we don't try and solve the optimization problem we try and get to an even better solution and this works really well so I'll give you a very simple example of this so this is what you can do much better than the compress density reconstruction using a neural network and this network involves the tricks I've mentioned so it's trained using Adam it uses group norm normalization layers and convolutional neural networks as you've already been taught and it uses a a technique known as Unets which you may go over later in the course I'm not sure about that but it's not a very complicated modification of neural networks so yeah this is the kind of thing you can do and this is very close to practical application so you'll be seeing these accelerated MRI scans happening in clinical practice in only a few years time this is not vaporware and yeah that's everything I wanted to talk about today optimization and the death of optimization thank you