 Let's talk about optimizers. Optimizers define how neural networks learn. They find the values of parameters such that a loss function is at its lowest. Keep in mind that these optimizers don't know the terrain of the loss, so they need to find the bottom of a canyon when blindfolded, essentially. Let's start with the one, the only, gradient descent. The original optimizer. Gradient descent involves taking small steps iteratively until we reach the correct weights theta. The problem here is the weight is only updated once after seeing the entire data set, so this gradient is typically large. Theta can only make larger jumps, and it may just hover over its optimal value without actually being able to reach it. The solution to this? Update the parameters more frequently, like in the case of stochastic gradient descent. Stochastic gradient descent updates the weights after seeing each data point instead of the entire data set, but there's a problem here too. This may make very noisy jumps that go away from the optimal values. It's influenced by every single sample. Because of this, we use mini-batch gradient descent as a compromise, updating the parameters only after a few samples. Another way to decrease the noise of stochastic gradient descent is to add the concept of momentum. The parameters of a model may have a tendency to change in one direction, typically if examples follow a similar pattern. With this momentum, the model can learn faster by paying little attention to the few examples that throw it off time to time, but you might see a problem here. Choosing to blindly ignore samples is simply because it isn't typical, it may be a costly mistake, reflecting in our laws. Adding an acceleration term though helps. Your model is training gaining momentum, the weights are becoming larger. It finds an odd sample, because of momentum it thinks very little of it though, but discarding it leads to a loss decrease that wasn't as drastic as you thought. This is where we decelerate our weight updates. The weight updates become smaller again, allowing future samples to fine-tune the current model. Not too shabby, but this is the loss function for a single predictor. Using multiple predictors that the learning rate is fixed for every parameter. Atagrad allows an adaptive learning rate for every parameter. I'm on a 3D surface plotter on akdemo.org. Cool site to plot out equations. This is a plot of z is equal to x square minus y square, z is the value of the loss, and this loss has a minimum value of y tending towards negative or positive infinity. If I were to start somewhere up here on the saddle point, my optimizer would go down in one direction of the y-axis, like how my cursor is moving. With an adaptive loss, I have more degrees of freedom to increase my learning rate in the y-direction and decrease it along the x-direction. In fact, this is what we see here. Adaptive learning rate optimizers are able to learn more along one direction than another, hence they can traverse this kind of terrain. In the optimizer update, the capital G, T-I-I, is the sum of squares of the gradients with respect to theta-I parameter, until that point. The problem with this is that the G term is monotonically increasing over iterations, so the learning rate will decay to a point where the parameter will no longer update, and there's no learning. We can actually see this effect here for the at-a-grad point. As the iterations go on, it learns slower and slower, even though the optimal trajectory is quite clear. At a delta to the rescue, it reduces the influence of past squared gradients by introducing a gamma weight to all of those gradients. This reduces their effect by an exponential factor. So the denominator doesn't explode, and this prevents the learning rate from tanking to zero. Cool, so we actually have learning rate updates for every single parameter. Well, if this is the case, why not just go even further and have momentum updates for every parameter? And this is what atom does. The only change you need to make from out of delta to atom is just add the expected value of past gradients. What does it mean? It means that we are slow initially, but pick up speed over time. And this is intuitively similar to momentum, as you build up momentum over time. In this way, atom can take different size steps for different parameters, and with momentum for every parameter, it can also lead to faster convergence. Because of its speed and accuracy, I think you can see why atom can be used as a de facto optimizer for many projects. Of course, we can go even further introducing acceleration in atom. Natum. And I could go on. It might seem like a ton of optimizers are out there, and there are. But we've literally just added a term to each algorithm, gradually making them capable of more things. But with all of these optimizers, which is the best one? Well, that depends on the kind of problem that you're trying to solve. Instant segmentation, semantic analysis, machine translation, image generation, so many problems out there with different types of losses. The best algorithm is the one that can traverse the loss for that problem pretty well. It's more empirical than mathematical. I hope this video helped you better understand the role of these optimizers and clear some things up too. If you liked the video, hit that like, click subscribe, and also watch some of my other videos on the channel. You won't regret it. Take care.