 All right, so hello, everyone. Hello, internet people. So, last time we discussed, like, the out of the box algorithm used for optimized cost functions, okay, because you remember the point is that in linear regression, at least in the ordinarily square and ridge regression, we have the chance to be able to write down the solution to the optimization problem we want to solve for the learning explicitly. Okay, so linear algebra gives you the answer directly. But in general, you cannot do that. Okay, it's not, and this is not even if the cost function that you are trying to optimize is convex, you cannot always write down the solution by hand. Actually, the cases that we've seen are especially essentially the only cases where you can. So we need a way to find minima of these complicated cost functions, okay. And by finding minima, I mean, really solving this optimization problem where it ties a high dimensional vector in general. And so what we said is that the most stupid and actually natural stupid and yet efficient way is great and decent. Okay, so you just look, you start from an initial guess for your parameters. And you are your loop. And at each step in this loop, you compute what we call the velocity, which is proportional to the gradient of the cost function. Okay, the gradient at the position at which you are at time t and by position, I mean, at the value of the parameters that is at time t, t at t. You compute the gradient of the cost function and you evaluate it at this position because when I say position, I'm thinking about a particle moving in a complicated landscape. Okay, and this particle is described by this parameter t at t. So it's a moving particle. So you compute the gradient at its position and you multiply by the learning rate. Okay. And I said that this learning rate is a very important quantity to choose carefully. Okay, because if it's too, too low, when you will update here, the new values of the parameter, which is given by the old value minus this velocity, you see that if the learning rate is too small, you will follow the gradient, but very slowly, okay, so it will have a high computational cost. The algorithm will be extremely slow. But you are guaranteed if the learning rate is small enough to really follow the gradient and converge to at least a local minimum, okay, which is not guaranteed to be the global one, of course. But if it is small enough, you will converge to a local minimum. Now, if you start to increase the learning rate, there is a sweet spot where you will converge as fast as you can to the local minimum and actually to in one step, if you choose that correctly, and at least in the simple case where the function is quadratic, okay. But let's say there is an optimal value for the learning rate which allows to converge fast enough without overshooting, okay, because if the learning rate is too high, you will just update your parameters by a value which is with the amplitude too large, and you will go further than the actual minimum, and you will start to oscillate if not diverge, okay. The learning rate is a very important quantity, okay. Then we've said that this gradient descent algorithm has a number of issues. One is that it's a deterministic algorithm, deterministic in the sense that once you have chosen the initial condition, the initial point from which you start the gradient descent, there is no stochasticity, you will always converge to the same minimum, to the same local minimum, okay. So this algorithm is very sensitive to the initial condition. Furthermore, it is actually very costly to compute this kind of gradient, okay, because this cost function again is here, it's a sum over all samples in the training data, okay, and typically you have maybe thousands or millions of training samples, and for each of these samples in this sum here, you have to compute like p partial derivatives, okay. So the complexity scales as p times n times the number of iterations of the algorithm, and p times n is a huge number in the high-dimensional applications that we have in mind, okay. So this is not very practical, and at the same time, again it suffers from this weakness, which is that if you start from closely or let's say if you start and around you, there is a very small local minimum, you will just stick there, okay. So to face all these problems at once, we introduced stochastic gradient descent, okay, and the idea is stochastic gradient descent is exactly, it's not so original, so fancy, it's just to do gradient descent, but each time you want to compute a gradient, okay, you will not compute the gradient on the whole data, but on the mini batch, okay, a mini batch is a subset of samples in the data, which is typically a very small subset compared to the size of the data, okay, and because these mini batches are chosen randomly, okay, this introduce a notion of noise, a notion of stochasticity, okay. So from one run to the other, even from the same initial condition, SGD will give you two different solutions, okay, once it converged, okay. Yeah, so the only difference is here that you compute the gradient on a mini batch, okay, that is a noisy approximation of the true gradient, okay, and by when I say true gradient, we have to be careful also here, because there is so the true gradients would be actually what, when we say true, what does it mean? If we had access to an infinite, let me divide here by n to create an average, okay, an empirical average, you see that this sum here over samples, if we had access to infinitely many samples, you see that this empirical average would become the statistical average, it would become the expectation with respect to the distribution of data, right. So when we say true gradient, it would mean actually finding the minimum of this expected cost, okay, by expected really the expectation with respect to the distribution of data, okay. So the expectation of the cost sometimes is called also the risk, again, when I told you, when we say an expectation with respect to the data of some quantity, we often call that the risk, okay, could be an error, and here the cost is an error, okay, you are trying to minimize an error between the predicted labels and the ones that are in the data. The expectation of the cost is the risk, the full expectation of the cost is the risk. So what we would like to do in reality is to minimize this risk, of course, we don't have access to infinitely many data, so there is a first level of approximation, which is that we are approximating the risk by this empirical average, okay, this is a finite size approximation of the risk, and this is what we call the cost, okay, you understand. All right, so this is the first level of approximation. So even if we could compute really the argument of this sum over all samples in the training data of this cost, this would still be an approximation because we have access to finitely many data, which leads to variance in the bias variance tradeoff and so on and so forth. All right, but then there is a second level of approximation in SGD is that we are not even computing the risk, sorry, the gradient of the, with respect to the whole data, we are computing with respect to a subset of the data, okay? You see the two levels of approximation, and here when I say this is a noisy approximation of the true gradient, true, you can interpret it really as the risk or as the gradient with respect to the full data, okay? This is an approximation of these two quantities, where one is the approximation of the other already, okay? All right, and how to imagine what is going on with SGD is you see that now that the cost itself is random, it's stochastic quantity due to the stochastic nature of the mini batch, the landscape itself in which we are moving is random, okay? At each time you compute a gradient, you are computing a gradient in a different landscape, okay? And hopefully these fluctuations of the landscape itself will allow the dynamic to converge to a minimum, which makes sense, okay? Which generalize well. All right, okay, I would like to continue from there, and I want to discuss, so just so that you see where we are in the notes, I want to discuss this, the momentum, okay? So, okay, what does mean momentum? So, this is yet another version of gradient descent, okay? Which now looks like this, so let me write stochastic gradient descent with momentum. And this algorithm is actually what really people use, even in very advanced applications, okay? And this is actually the most efficient, okay? In the sense that this is the algorithm that empirically seems to lead to the best minima in the sense of the minima that lead to good prediction performance, okay? All right, so, and here I put the stochastic in a parenthesis because this idea of momentum that we will discuss now, you can apply it or to the plain gradient descent, or you can also use it in the stochastic gradient descent, okay? So, the notion of stochasticity here just means if you keep the stochastic, it means that you are considering mini batches. If you don't keep the stochastic, you are considering the full gradient, okay? With respect to the whole data. Momentum, you can apply it in both cases. All right, so the algorithm looks like this. Again, you have a four, okay, over the time. And here I will write the dynamics like if it was just a plain gradient descent, okay? But again, you can do exactly the same for SGD by introducing mini batches, okay? So, the velocity, so you have of course an initial condition. So, you put the velocity and then you have the update, which is as before. Okay, so again, if you want to make this stochastic, which usually you want to, you just replace this by a gradient with respect to a cost evaluated on a mini batch, okay? You remember, okay? That's the only difference. This makes it stochastic. Okay, so what is the difference with before? The only difference is this term here, right? See now that the velocity is not anymore just related to the gradient. But it's also, there is a term here proportional to the velocity at the previous time step, okay? So as before, the theta is the learning rate while this gamma is called the momentum. So why is that called the momentum? This is what we'll try to understand. So the momentum, this vocabulary comes from physics, of course, okay? So the momentum of a moving object, okay? It's the resistance to change its velocity, to change its direction, right? Okay, so related to inertia, okay? And you need to have mass, to have inertia, okay? Sorry, the inertia is the resistance to changing your direction and the momentum is essentially your velocity, okay? Times your mass, okay? All right, so let's try to understand the analogy with physics. Because again, I think a natural way, at least for physicists, to visualize this type of dynamics, this gradient descent dynamics is really as a particle moving in a landscape. And now we will give a mass to this particle and therefore it would have inertia, okay? All right, so let's understand that. So the way to understand the link with physics here is to first realize that this dynamics here, you can write it in an equivalent different way, okay? Which is like this. So this is completely equivalent to this dynamics. So you will see in the notes, by the way, and I hope you do, if you try to read the notes as well, okay? What I do is essentially making them a bit more alive, but I'm really closely following the notes. What do I want to say? Okay, I don't know what I'm going to say. It will come back in one minute. Where t plus one is by definition is delta t plus one minus delta t, okay? Yeah, what I wanted to say is if you look at the notes, you will realize that everywhere the learning rate, okay, they write it as a function of time, okay? They put an index t here. Nothing prevents you to let the learning rate be a function of time, of iterations, okay? This is called the scheduling, okay? You can schedule the learning rate, typically to decrease as time goes on, okay? And if the learning rate decreases, it means that you are reducing the velocity of your particle, okay? So if the learning rate is high at the beginning of the dynamics, it means that you are exploring a larger, larger regions of the landscape, okay? And maybe at some point, your dynamics is kind of trapped in a minima, but at a large scale, okay? And then if you reduce the learning rate, you will start, it's like reducing the temperature in physics. You start to explore more fine grained landscape. You are exploring then the intra landscape of this sub-minimum, okay? There may be many minima, many large minima at large scales, and then as you reduce temperature, you are focusing on, you are zooming on looking at sub-region like this, then a further sub-region until you converge, okay? You see? Maybe at high learning rates, sorry, let's see, I don't know, maybe at high learning rates, you will explore this, you know, you have kind of two minima like this, you will explore both, and then maybe there is one that you will favor, and then you reduce the temperature, the learning rate, so you are now zooming this region, and you further reduce the learning rate, and you get trapped in finer and finer minima like this, okay? It's like reducing the temperature, okay? But I will keep it constant, okay? And anyway, if you take it small enough, it's fine. All right, so let's see the equivalence between these two dynamics, and why do I want to show this equivalence? Because we will see that this second equivalent writing of this dynamic will connect very nicely with physics, okay? So how can you see this mapping? The way I see it is by just checking, okay? So let's see, so from, let's say, the set of equations one, and this is the way we call the set of equations two, so one implies what that delta t plus one, which by definition is here, is below, okay, is theta t plus one minus theta t, gives us what? We look at this update here, this is minus v of t, okay? And we look at the definition of v of t, the first update rule, okay? And this gives us minus gamma v t minus one minus eta gradients of the cost, okay? All right, okay. So now if I look at the second equivalent representation, tells us that delta t plus one is equal to gamma delta theta at t, which is what? Which is by definition t minus t minus one minus eta gradients of the cost at theta t, okay? I did nothing, I just rewrote this second part here, this delta using its definition, okay? All right, so how to see that the two set of equations are equivalent? We see that they have in common the second term here and here. So to see the equivalence, we need to check, is it true that this first term is equivalent to this second term, which means is it true that theta t minus theta t minus one is equal to minus v t minus one, right? What we want to check. And this is exactly what is what we have here, right? With a time index reduced by one unit, okay? So this is true. So the two set of equations are indeed equivalent, okay? You just use the definitions and you check. Nothing fancy. Okay, all right. So we look at this second dynamics, which is now written in terms of the increments of the parameter values instead of the actual values of the parameters, okay? And let's see how to connect that to physics. Okay, so parenthesis, some physics. Let me write some equation of motion, okay? Should I write this with theta? Okay, so let's say we have a particle described by its position, okay? And its position will be, as you can imagine, called theta, okay? And here is an energy function, okay? So what does represent this equation of motion? Which term? This? This is a drag. This is a viscous term, right? So this is a term that is opposed to the velocity vector, right? So it's something that tries to reduce your velocity. So it's viscosity, right? This is the velocity vector here, okay? Here, this is an energy term, okay? We have the gradient of energy potential. So this is a force, okay? This is a conservative force. G derives from the gradient of the potential energy, okay? Energy potential, okay? Again, the theta, this is the position of the particle, okay? And this term is what? Is the inertia, right? So it's the term proportional to the acceleration, okay? So this Newton's equation, okay? Or equation of motion represents the particle with a mass, m. This is the mass, okay? Who is in a viscous fluid with viscosity constant mu, okay? And at the same time, experiences a potential energy, okay? So a force derived from an energy potential, which is here, okay? All right? And I claim that this equation of motion is the continuous version, okay? Of this discrete equation of motion, okay? Of this gradient descent with momentum. But you see gradient descent is an algorithm. You have discrete steps, you iterate discrete steps. But I claim that if you take the discrete steps to be epsilon close, okay? Very small discrete steps. This gradient descent has the same dynamics as this particle, okay? In a p-dimensional space, okay? So let's see. Let's discretize. We discretize this equation of motion. What is the discrete version of the time derivative here in the viscous term? It's just theta t plus delta t minus delta t, okay? Over delta t, right? You remember what is the definition of a derivative, okay? The limit when epsilon goes or let me use delta, delta t goes to zero of this object here, right? Okay? This is the mathematical definition of a derivative, right? Now we're not taking delta t really going to zero. We just take it small enough, okay? All right? The last term, the energy term is unchanged. And the second derivative is what it gives. You just apply twice the first derivative and it gives theta t plus delta t minus two delta t, theta t plus theta t minus delta t over delta t squared, okay? And there is the mass that I forgot. Okay? This is the discretized version of power equation of motion, right? So now let's compare that with the equation we had. Let me paste it. How do we identify now? Sorry, the identification is not direct. We need to work a bit more. That's why I was... Now I claim, and this is an exercise. This is like a two minute exercise. If you take this equation here, this discretized equation, okay? And you isolate on the left the delta, so sorry, now you define, you define as before delta t plus one to be delta, okay? By definition. You plug that in the equation here. You write everything in terms of these differences here and you isolate the delta t plus one on the left. What you obtain is this equation to check. It's really one line, but you should do it. Sorry, sometimes I put the time index on the top or it doesn't change anything. You have to get used to this kind of small mistake here. Okay? So this, what I wrote here is this. Just return as a function of these time differences here. Okay? And now let's compare with what we have here. So what is the mapping? If I map, if I define this eta to be delta t squared over m plus mu delta t and the gamma, this momentum to be m over m plus mu delta t, I have the same equations and the cost here is my energy function. Right? And so that's why we call this parameter in the gradient descent with momentum. We call it the momentum because you see that indeed it's linear in the mass. Okay? This is what leads to momentum. The momentum is mass times velocity, right? Okay? And you see indeed also that the learning rate is proportional to the time increment to the square, but it's related to the time increment. Okay? This controls the speed of the dynamics. Okay? All right. So now we have our equivalence between our physical model, our particle moving distance k and this gradient descent dynamics. Okay. Let me just mention that there is a last variant. Okay? Which is called, what's the name again here? Nesterov accelerated gradient. Accelerated gradient or nag, which is the same as gradient descent with momentum, except that the only difference, so as before you have this momentum term, but the difference is that you evaluate the cost, sorry, the gradient of the cost, not at your actual position, but then there is a typo in the notes. There is a plus, it should be a minus. And this is, if you want the expected position, if you were not updating, you know, if you were not updating your position according to the rule below here, if you were just following your momentum at this actual time, you would end up after a unit of time at this position here. Okay? This is your actual position plus a time increment along your previous velocity. Okay? Along your previous direction, right? Again, the minus here is a convention. We call velocity minus the velocity. Okay? So you are not updating the gradient, you are not computing the gradient at the position where you are at time t, but you are looking at the position where you would be if you were continuing your actual dynamics for an additional unit of time and you compute the gradient at this position. Okay? All right. So this is just a small variant. Both are working very well in practice. Honestly, I don't see a difference between one or the other, but now you know it exists. Okay? All right. So let's see how these things work by playing a bit with the notebook number two on gradient descent. Okay? Let me stop sharing. Is there any question here from anyone on these things? There is a question. Why do we call it momentum? I think I answered the question, right? Firas is asking that and you updated your question right now, but I think I answered precisely the question Firas, no? Can you explain again NGG? So N-A-G, Nesterov Accelerated Gradient. The only difference is here with respect to gradient with momentum. Sorry? So the only difference with respect to gradient with momentum is that instead of when you update the gradient, instead of computing the gradient at the actual position, at the actual value of your parameters, you will compute the gradient at the value that would take the parameters if you were following your actual dynamics. So your actual velocity vector for an additional unit of time. Okay? So you look at the point where you would be if you were continuing according to your actual velocity for an additional unit of time and you compute the gradient there and you will update according to the value of the gradient at this position, not at the one where you are right now. Sorry, but how can I see that from the formula of the equation? You see the, okay, let me share that again. So he's asking how can we see that from the equations? So you see, so let's look again at the gradient the gradient descent with momentum, which is there. You see that here the gradient is evaluated at your present position at the actual value at time t of the parameters. Okay? You evaluate the gradient, you think of a particle in a landscape, you look at the gradient at the position of the particle at time t, which is here. Okay? And then you update the position at time t plus one according to this rule, which is the old value minus this velocity vector that has been computing according to the rule above where theta t appears. Okay? But the gradient has been computed at your actual position. While in a state of accelerated gradient descent, the only difference, the only difference is here. You see, what is this position? Again, if I think of a particle, this position is the position where you would end if you were continuing your dynamics by an additional unit of time. You see that your velocity vector at time t minus one would lead you at your actual position at time t minus this thing, right? And you are computing the gradient at this position instead of just theta t. Is it clear? Maybe you have to answer something, so yes or no? Or maybe not completely, but I just answered your question. Is it clear? Yes, it is clear now. Okay, perfect. Maybe this is the vector. You have to imagine that in dimension p. Maybe this is your theta t. And if you were continuing your dynamics by a small increment according to this velocity, you would end up maybe, let's say this is minus gamma vt minus one, and you would compute the gradient at this position instead. And this is in NAG. You compute the gradient here. While in the plane gradient descent with momentum, you compute the gradient here. So you see there is a small difference. Okay. All right. So this is the notebook number two on gradient descent algorithms. Okay. So as usual, I will not go over the details. This is your job. It is these things are self-contained. Okay. You just have to understand and run the examples by yourself. So let's see. So what we want to do is to visualize a bit this dynamics. Okay. Of course, we cannot visualize things in high dimensional spaces. There is no way to project complicated dynamics. I mean, you would lose, you would have to project in low dimensions. So what we'll do is directly set up problems in low dimensions. And by low dimension, we, I mean two dimensional spaces, so surfaces, okay. Which mathematically concretely means functions of two variables x and y. Okay. And the cost, if you want, the height, sorry, the cost function is the height. Okay. It's a variable z. Okay. And this z, this height depends on the position x and y. Okay. So we have a particle moving on a two dimensional landscape. And we are measuring the cost by the height, which is a function of x and y. And we want to visualize essentially how this gradient dynamics really look like. And to play a bit with maybe to see if we add some noise in the dynamics, what changes. If we change the momentum, we change the learning rate, how the trajectories evolve. Okay, to really get a concrete visualization of what is going on. Of course, in again, in very high dimension, where you are not moving in the surface, but in a 10,000 dimensional space. Okay, things are different. Of course, but still we can gain some intuition of what is going on two dimensions. Okay. Okay. So this first cell is just setting up functions for visualization. Okay. So I will not go through. These are just, and I didn't code them actually. And this will define, these are routines that define the surfaces in which we want the particles to move in. Okay. So one is called monkey saddle, et cetera. And these are typical two dimensional surfaces in which this kind of dynamics are tested. Okay. So here is a cubic function in x and quadratic in y. You have another one. So each time we define the surface, which means it outputs the height, the z as a function of x and y. Okay. And we also have a routine which computes the gradient of this function. Okay. The two dimensional gradient. Okay. So you see, for example, that this monkey saddle, which is defined in some way, we have a term which is cubic in x and indeed in the gradient with respect to x, we have three times x squared that appears. And here, the gradient with respect to x is just three times y squared, which is indeed appearing here. Okay. These are just computing ranges. This is the same for other surfaces. Okay. Okay. So here is a visualization of the surfaces. So this is, I think, the so-called monkey saddle. And why is it called saddle, monkey saddle? Because here we see a saddle point. Okay. In what is a saddle point, it's a point which is nor a minima, nor a maximites minimum according to some direction and it's maximum according to some others. Okay. I really don't know. I really don't know. And so just one side remark is that actually in high dimensions, very high dimensional spaces, the majority and the vast majority of extrema, okay, extrema include minima, maxima and saddle are actually saddle points. Okay. So the typical extremum in high dimension is a saddle point. Okay. This is also a saddle point in the middle because along that direction it's minimum and along that one it's maximum, but less rich, you see. Okay. And this one is, here is hard to plot in 2D like this. So we plot the contour curves. Okay. That's the way to visualize. So you know what are contour curves. So one line here means if I follow one line of a certain color, it means the line along which the height is constant. Okay. So we see that here the minimum is here. Here we have a valet. Here we have another valet with a local minima, etc. Actually this function, it's hard to see, but this is actually the case. This function here is also convex. Okay. All right. And we will test the algorithms that we discussed. So gradient descent, the plain gradient descent. Okay. So the simplest update rule. Then we will test gradient descent with momentum. Okay. And this Nasterov accelerated gradient where the only difference you see is that the gradient is computed at the position you would be if you were not updating the dynamics. Okay. After one unit of step. So here there is no stochastic gradient descent or anything. Okay. We are exactly computing the gradient. We have a two dimensional function. Okay. It's slow dimensional. So each time we will truly compute the gradient of the energy of the cost. Okay. No stochasticity here. No need. Okay. All right. So let's see. So here I'm just coding the different dynamics. Okay. This is the gradient descent. GD. These are just initializations. And you see, so here there is a notion of noise that I didn't. Maybe I mentioned it quickly when you add noise. Sometimes we call that long van dynamics. Okay. We add a Gaussian noise to the gradient, which is one notion of stochasticity. Okay. But maybe you remember what I said is that in SGD the noise induced by the mini batches is of a different nature. It's not the same as just adding Gaussian noise. Okay. But still we will emulate a bit what's going on with noise by defining a noise here, which is just a Gaussian noise with a variance even by this noise strength. And that will be added here to the velocity. Okay. And also multiply by the learning rate. So here is a step that computes the gradient. Okay. You see we have a grad. This function will take as input a function, which will be the gradient of the actual function we're trying to minimize. Okay. You know, in Python, you can feed a function with a function. So this function grad will be the gradient of the cost that we try to minimize or the surface that we are trying to minimize. Okay. And here is the update rule. Okay. And here is the same for gradient with momentum. The learning parameter in both cases called eta. We have a number of steps. So here there is no really notion of epochs because there are no mini batches, but we still call the number of iterations, number of epochs. And the gamma is the gamma is the momentum. Okay. And here we have the Nesterov-excited gradient where the only difference is the position at which you compute the gradient. So you see the gradient here will be evaluated not at the actual value of the parameters, which would be params, but instead at the actual value minus gamma times the previous velocity. That's the only difference. Okay. So let's see. Now I will... Okay. So here the first case I will do is actually an even simpler shape than what we looked before. It will be just an ellipsoid. Okay. So parabola, which is not... Which has a different... How do you call this constant? Century city along the two directions. Okay. So for example, if I take a perfect parabola with the same century city one and one, okay. And I took learning rate to be 0.1. I don't have noise and I'm computing the plain gradient descent okay. We see something which is not fancy at all. We follow exactly the gradient. Okay. And we are reaching the minimum. Okay. Now if I put a bit of asymmetry, we see... So you see this... Each time here it's not clear from the plot, but this is actually the case. Each of this line should be... Here it's an interpolation. So it should be at least almost orthogonal to the contour curves, which means that you are following the gradient and indeed it looks like. Okay. Right. If I was taking a very small learning rate, okay. We don't even see anymore, but here now the lines are perfectly, almost perfectly orthogonal to the contour curves. Okay. Let's put a bit of noise in variance one. Okay. So you see that the first steps where the gradient has a strong amplitude. Okay. The noise does not affect much the trajectory, but when you are close to the minimum, the gradient has a small amplitude. So the iteration, the value that will compute there will be dominated by the noise. Right. Because the gradient is almost zero. You are close to the... The clean gradient without noise is almost zero. You are at the minimum. So the only source of change of gradients here is the noise that we are adding artificially. And we see indeed that we are fluctuating around the minimum. Right. But this can be useful if you had a very fine grain structure close to the minimum. Here you would explore what is going on around this minimum. See if you put too much noise. Start to be complete mess. Okay. You are not following the gradient anymore at all. Okay. Okay. Now, let's make things a bit more interesting. So what is the next year? Okay. So now I'm still in this very simple parabola. Okay. But I will run the three different dynamics. Okay. Gradient descent black, gradient descent with momentum in pink or purple and the NHG in blue. Okay. And I'm starting from different initial points. Okay. The function is convex. So anyway, they all should converge to the minimum here. Right. But let's see how things change. So I have 4000 epochs. Let's put a bit of noise. So both for NHG and gradient descent with momentum, the momentum is 0.9. Okay. It's an arbitrary bit value. Let's just check something like a sanity check. If we have no momentum, the free dynamics should be similar. Right. Should be the same actually. Let's check. There is no noise. Okay. Let's increase a bit the number of iterations. Otherwise, or let's increase the learning rate because you see the learning rate was small. It was very slow. After 4000 steps, we are far from the minimum. Right. So here I increase the learning step. All these dynamics are plain gradient descent. Okay. And you see if I increase too much the learning rate, which is already small, but let's say 2 to the power of 10 minus 3. Okay. It's still here. It's converging. We're fine. Let's make it higher. Let's see. Okay. No problem. Let's be a bit more extreme. Let's see. I don't know when it starts to be problematic. It's okay. It still converges. Now it diverges completely. So you see now we have a problem. So let me put 0.1 still converges, but you see suddenly if I increase a bit, let's say 0.5, it starts to oscillate completely. Okay. So here you are overshooting and you would have an infinite oscillation. Okay. You are at a regime where you would go from one point on the surface to the next. Next you are bouncing like this. Okay. You will never converge, but you see if I increase a bit more, it explodes. It diverges. Okay. I have numerical issues. They are just going out from the surface. All right. So now let's play with the momentum and see what happens. So let's put something horizontal, 0.01, let's say. Now I'm adding momentum. So the three dynamics are different now. Okay. We start to see interesting things. We see something that we would, for example, with the gradient with momentum or the NAD, we see something that we would expect from a rolling ball in a sink. Okay. Now you see inertia entering into the game. Right. So you see that here, for example, let's say at this point of the dynamics for gradient with momentum here, the gradient is not pointing in that direction. The gradient is pointing down the minimum. Yet you will continue along that direction due to inertia. Right. So it's really heavy. It's really like you have a ball in a sink like this. Okay. The ball will not follow the gradient. It will follow the gradient at some part. Okay. You have an energy function that you have a potential that leads you to the minimum, the gravitational energy, but at the same time you have inertia. So you follow this. Okay. If you increase too much the momentum, what happens? What do you expect? It should. Right. And it never converges. Okay. It becomes a mess. Okay. So you see this nice trajectories like this. Okay. And here the viscosity is played like for a ball in a sink. The viscosity would be played the equivalent role is the friction with the medium. Okay. The ball has some friction. So you lose some energy and that's why at some point you converge. Okay. And here the viscosity is like the friction. Okay. That's why at some point you converge. But if the viscosity is not enough, it will just oscillate. Okay. All right. So you should play a bit with this. And now let's see the next curves. Okay. This is more interesting. This is now this complicated convex function. This so-called bill function. I think it's called like this. It does not really matter. It's just a rather complicated function. And you see that I start the dynamics from four different initial conditions. From this corner, from this corner here, someone somewhere down here, and somewhere here. Okay. And I'm looking the three dynamics for each of these four initial conditions. Okay. So let's see. So you see now that the learning rate eta is much smaller because the function is much more complex actually. And you see the difference. Why do we need a much smaller learning rate? Because you remember the learning rate is essentially dependent on the curvature of the function. Here on this simple parabola, the curvature is approximately the same everywhere. Right. The function is really homogeneous. Okay. Well, here you see that you have huge differences between what is going on here in the middle. It's very flat. You have very flat directions everywhere. Okay. You are at the bottom. Well, here the curvature is very strong. Right. And you need the curvature, sorry, the learning rate to essentially be able to deal with the steepest directions. Okay. Because if the learning rate is too high, when you are in a steep direction, you will overshoot and start to oscillate. But the problem is that if you have a small learning rate and you start to be close to a bottom where it's flat, it becomes very slow. But that's life. Okay. That's life. But hopefully with momentum, that's exactly the point of momentum, even if the gradient is small and the learning rate is small. And therefore the dynamic should be slow when you are in a bottom where it's flat. Because of momentum, you keep velocity. You see, if the ball accelerates, if you drop a ball in a sink, it will accelerate when it's steep. And even if you arrive in a flat region, due to inertia, okay, due to your momentum, you will continue. Okay. Even if it's flat, well, if you are just following the gradient, you will stop abruptly and then be super slow. You understand the difference? And indeed here we see that, for example, this NAG is accelerating here. And you see that it climbs up the wall due to momentum. Okay. Which a priori is bad. In this case, you would like to reach directly the minimum, but that's life. And then you follow again the gradient and you converge. Okay. While gradient descent here is directly converging to the minimum. But here we are lucky because we gradient descent is the best, but that's like luck. We initialized the dynamics very close to the minimum. In this case, of course, gradient descent is good. Okay. But this never happens. Okay. And you see this kind of wiggly trajectories here that are following, that are balanced between following the gradient and following your inertia. Okay. So let's see again what happened, but you see the learning rate is very small. And we are doing many steps. Okay. 50,000 in this case. See if the learning rate is 10 times bigger, but still extremely small. What happens? It diverges. Here I have problems. Okay. It means that the dynamics just went out of the curves. Okay. Everything exploded. Okay. So it's super sensitive to learning rate. Let's decrease again the learning rate and put more momentum to see what happens. Okay. So you see all that is very sensitive to the parameters you are using. Okay. So too much momentum means that you see, for example, I don't even know where it started this trajectories, but it started somewhere where there was some stiffness. Okay. It took, let's say this one here. It took some velocity. And because the inertia was so large that even if when you arrived in flat regions like this, you didn't reduce your speed at all and you continued along that trajectory until you reach a point where the gradient is strong enough, which means the gravitational energy physical terms is strong enough so that it competes with the inertia and it brings you back in the direction of the gradient. Okay. But you really need to reach first the top of the hill where the potential energy starts to compete with the inertia because your inertia is very strong. Okay. And then you go down again and you take too much speed and you continue until the top of another hill and you will never converge. Okay. So you have a very subtle competition between the momentum, the learning rate and all that. Okay. So the end point of the discussion is that it's highly non-trivial to select these parameters in such problems even more in high dimensions. Okay. So you have to play with these parameters and how do you select parameters in machine learning? Would you cross-validation? Okay. So you select a set of parameters. You train your algorithm on your data. You see how it works and to see how it works, it means obtaining a prediction error. You test your algorithm on test data. You evaluate some prediction error, gives you a number and then you repeat for another set of parameters in this case of different learning rates, momentum, whatever you want until you reach something which is good enough. Okay. So I really advise you to play a bit with this code. I mean, I think you gain a lot of intuition out of it. Okay. So I wanted to start logistic regression, but I won't have time. So tomorrow we'll do logistic creation, which is a chapter, I don't know, six. I think so it's the last chapter we want to do. Okay. So logistic regression will be, so where is it? Let me stop sharing. Just a small introduction to what we do tomorrow. Yes. So the question is how do we know which gradient descent method is the best for a given problem? There is no magic recipe. So maybe if I would be, I mean, people that play with these things all day long and, you know, that are real numerations, I'm sure they would give you tips. I'm not that type of person. So I don't really know. The only recipe that is systematic to answer that is cause validation. So because there is no theorem, I mean, in very special settings, there are theorems. Okay. But in general, in real applications, of course, they do not apply. So the only way to know which method is the best is to do cross validation, which means essentially testing different methods, scanning over large enough ranges of parameters of hyperparameters values. Hyperparameters means all these parameters you need to fix in the algorithm, learning rate, momentum, whatever, or number of features that you are using. Which means the complexity of the hypothesis class. You have to test different settings and select the one that leads to the best prediction error. That's the only systematic way to do that, to answer that. Okay, there is no magic. Okay. Yes. Yes. It's empirically, me it seems not to be the case. I don't know. And actually, if you read a bit more, they say that in most application, people use just stochastic gradient with momentum. And I cannot really tell you why energy is better in some cases than gradient descent with momentum or the other way around in other cases. I don't know. And I don't think there is definite answers in general. Okay. But honestly, the two are very comparable. If you try to be, if you play a bit with the code, you will see that there are no big differences. Okay. But practically in reality, people in 99% of the cases, people use stochastic gradient with momentum. Okay. Especially in deep learning and all that. This is what works the best. Okay. Other questions. Yeah. So here there is a whole part in the notes. If you are, if you want to read that, I think that's useful. It's methods that try to emulate if you want second order information in the sense that these are methods slightly more complex that I call the RMS prop or Adam, it's other optimizers that are, that have additional terms like this that are mimicking what you would do if you were computing the Sian. Okay. But not really. So yeah, but the idea is basically the same. You follow gradients, but there are additional tricks here that allow you to improve in certain cases. But again, in general, what leads the best performance is what we discussed. Okay. In certain special cases, Adam and RMS prop, this more complicated dynamics here, lead to some improvements, but in general, the out of the box solver is stochastic gradient with momentum. Okay. But if you want to have a lead, to have a read, I think it's useful to know that it exists. Okay. I will not discuss it. Okay. So this is what we discussed today. Yeah, maybe just the final words that are in the notes, which are useful that, but I already mentioned along the course, but let me recap. One thing you have always to do is randomize the data. Okay. When you create mini batches in SGD, each time you start a new epoch, you have to randomize the data to create new random mini batches. Okay. This is how you induce stochastic in the problem. Okay. Transform your input means what it means something that I mentioned at some point, but which you should always keep in mind before processing data in machine learning, you have to normalize the data to standardize and normalize the data. Okay. Because you remember, if you are considering data, let's say you have a number of features. Okay. You are putting all these features into a big design matrix, this matrix X, each column is a different feature. And this feature can represent totally different quantities. Okay. Maybe some feature is the concentration of a very, of a chemical. Okay. So it's very small units. Okay. Maybe 10 minus 6 as an order of magnitude. These are the typical entries of this feature. And maybe the application you are considering another feature is something related to the scale, the age of the universe. Okay. So maybe billions are comparing one feature, which is 10 minus 6 with something which is 10 power nine. So of course, if you process this data without changing anything, naturally, the algorithm will always give more weight to features which have a higher amplitude. Okay. But this is an artifact. What you should need is to standardize your data, which means to rescale each feature so that the variance along one column is one. You are rescaling the values such that you don't have an a priori bias for one feature or another one. Okay. There are the same orders, all the numbers that are in this matrix. Okay. And here monitor the out of sample error is just saying that whatever optimization procedure you want to do and whatever optimization of hyper parameters, whatever parameters you need to fix, learning rate, momentum, number of features. In any case, what matters at the end is the prediction error. So to make choices in machine learning, you should make choices according to what leads the best prediction error. Okay. And also here they mentioned something. When you are optimizing your cost function, okay, you should actually online plot the prediction error, which means that maybe you iterate your learning procedure for 10 steps, you should compute the prediction error on a certain test data of the model that you got after 10 steps of learning. Okay. And maybe you iterate for 10 more epochs, 10 more steps. Okay. And you compute again the prediction error and you do that as time goes on, because what appears is that it's not always good to really find parameters that minimize the training error. Because you remember sometimes when you minimize the training error, what you do actually at some point is that you overfit. Okay. And overfitting means a large difference between the training and prediction error. Okay. So in order to reduce overfitting, sometimes it's good to do what we call early stopping, which means not learning until you really find the best like until you really mean minimize the training error. So when I say find the arg mean of the cost function, in reality, this is not always what you want to do. What you want to do is find a minimum that has good prediction performance. So what you can do is that as the dynamic syllables, as you are learning your parameter, you track also the prediction error. Okay. Which has a cost, of course. Okay. But you should do that. And you should stop when you start at some point, what you will see typically is that as times goes on, okay, the training error will always decrease. Okay. This is the training error or the in sample error. But maybe the prediction error will do something different. It will decrease. And maybe at some point it will increase again. But both these points here, you start to overfit. Okay. So as times goes on, you should track both the in sample error and the out of sample error, the training error and the prediction error. And you should stop when you see that you reached a minimum for the prediction error. This is called early stopping. Okay. And that's it. And here they are just saying what I told you in words. Adaptive optimization methods, which are these methods we did not discuss that try to emulate second order information is more advanced solvers often have lower performances. So I will not discuss that. Okay. Tomorrow we discuss narrowing action did logistic regression. Okay. Or classification, how to deal with data which have discrete labels. Okay. We want to classify images, categories, whatever. Okay. So until now we did, we considered regression where the labels, the outputs were continuous variables. Okay. Now we want to do, we want to process labels which are discrete zero, one dogs, cats, whatever categories. Okay. Okay. And actually this is this will introduce the baby neural network, which is called logistic regression connected to the perceptron. Okay. We will code perceptron. Okay. Any question? Okay. Yes. Yes. What happens if we introduce some notion of noise in linear? Okay. The question is that in stochastic gradient descent, what I mentioned yesterday or not yesterday because we were all barcola or somewhere. I mentioned the other time that the noise induced by the stochasticity in stochastic gradient descent is not Gaussian. It has complicated statistics. It is correlated, et cetera. So your question is what if we would induce artificially this kind of noise in regression or in other problem. We don't know how to induce this noise because even analyzing, let's say there are proofs in simple settings that this noise is non-gaussian, but it's not that we can write down what statistic it has. It's an extremely complicated statistics and we don't know yet how to write it down, essentially. So we cannot emulate it, but we don't need to emulate it. You just do stochastic gradient descent. You just do it, you know? Because it's not only a matter of the type of noise that you introduce, which helps you. It is also a matter that it lowers the computational cost. Okay. So it would be kind of stupid to compute the full gradient and artificially add a noise up to a while you can get the same benefits by just doing SGD and computing gradients, which are much less costly. All right. So see you tomorrow. In the end of the afternoon, I will send a link as usual. Ciao, everyone.