 Greetings, fellow learners! Welcome to another video in a playlist of videos on fundamentals of deep learning and neural networks. But before we get started with this wonderful world of optimizers, I have a thought-provoking question for you. How do you usually learn new things? Are you a visual learner where you prefer visuals to understand concepts? Or are you more of a hands-on learner where getting your hands dirty would be the best way to learn for you? Or is there another method that you prefer as a learning mode? Please comment down below your thoughts on this and let's have a discussion! Now this video is going to be divided into three passes where in the first pass we're going to start with an overview of optimizers themselves, then go into their types and some other details too. So let's get to it! This here is a simple feed-forward neural network. Now we want this neural network to take in an image and be able to classify it as either a dog or not dog. And so the neural network needs to learn. Specifically these edge weights or weight parameters need to be learned. Let's illustrate a specific optimization algorithm called stochastic gradient descent. So first we pass an image into the network, then the network makes a prediction of whether the image is either a dog or not. This network will determine like a probability value that's between zero and one. We compare this to the ground truth, which is going to be one since it's a dog. We'll pass both of these prediction and the ground truth into a loss function to generate a loss. For this classification phase, this loss function could be a cross entropy loss. Now we take this loss and back propagate it through the network in order to compute the gradient of the loss with respect to the individual parameters of the network. For more details on this back propagation process, check out this video on back propagation. We use the gradients computed in the optimization algorithm. In the case of stochastic gradient descent, we update each weight parameter using this formula. This here is one iteration of weight updates and we repeat the process hundreds or thousands of times during the entirety of the training process. Then during the inference phase, we will pass an unseen image to the network and ideally it will be able to make the correct prediction of whether it is a dog or not a dog. Squeeze time! Have you been paying attention? Let's quiz you to find out. What is the difference between back propagation and an optimizer? A. Back propagation computes gradients while optimizers compute updated model parameters. B. Back propagation computes updated model parameters and the optimizer computes gradients. C. Back propagation and optimizers are interchangeable terms describing the same process. Or D. Back propagation and optimizers are unrelated processes in the neural network training. Comment your answer down below and let's have a discussion. And at this point, if you do think I deserve it, please do give this video a like because it will help me out a lot. That'll do it for quiz one and past one of the explanation, but I'll be back so keep paying attention. Optimizers define how neural networks learn. This here is a non-exhaustive list of optimization algorithms. And while it seems that there's so many of them, they're all actually kind of related to each other. So let's talk about each of them a little briefly. For example, sarcastic gradient descent. This is exactly what we described in the previous pass. So for every single input label pair, we will update label parameters step by step. Now the concept of momentum helps speed up learning. And this is done by making larger learning jumps for common examples and smaller learning jumps for outlier examples. Now in the Nestrov accelerated gradient or NAG method, this is the equivalent of sarcastic gradient descent plus momentum plus a concept called acceleration. Acceleration ensures that the algorithm better anticipates the direction of the gradient for stable training. Next is Atagrad. This is an adaptive learning rate algorithm. And this means that the learning rate can be different for different parameters. In the Atagrad case, the learning rate depends on the gradient. So in this case, the parameters that have large gradients will have a smaller learning rate. But parameters with smaller gradients will have a larger learning rate. But there's a little bit of a problem with this. Here the dominator is a sum of squared gradients over time, which can increase to an arbitrarily large number. And so the overall learning rate will become close to zero. To combat this, two optimizers were developed independently of each other. One of them was at a delta and the other was RMS prop. Both have slight modifications to ensure that this denominator does not explode over time. And this allows stable convergence. Now next is Atom. Atom is another adaptive learning rate algorithm that does RMS prop but also adds some momentum. It performs well on many problems and it's one of the most common optimizers that we see today. And then we have Natom, which is a combination of the Nesterov accelerated gradient and Atom. And it offers fast convergence and good general performance. So for more information on each of these optimizer algorithms and how they interplay with each other, digging into the maths to understand why things happen the way they do, I recommend checking out this video on optimizers. Quiz time! It's that time of video again. Have you been paying attention? Let's quiz you to find out. Why is Atom used over stochastic gradient descent in practice? A. Atom is computationally more efficient and has fewer hyperparameters. B. Atom is less prone to overfitting compared to stochastic gradient descent. C. Atom generally converges faster and requires less manual tuning. Or D. Atom has a more straightforward optimization landscape. Comment your answer down below and let's have a discussion. Now that'll do it for quiz time for now but do keep paying attention because I will be back to quiz you. For this pass I wanted to get a fun take on the types of optimizers and so I created a short video that we can watch together. Hey, wanna hike Sunday? Sure, where to? Los Canyon. Oh, the usual. Yeah, it's a fun downward hike to the lowest point. I don't know, I kinda struggle to find the end. Well, maybe you should take baby steps like me. But you take forever. Y'all do know there is a middle ground, right? True, and we can even run down some parts since their trajectory is obvious. But for another perspective, it would be better to go off trail and adapt to your surroundings. Is that fun? More degrees of freedom. Oh nice. So you're the best one here. Nah, he tends to slow down when the hike gets long. I'm more adaptive and faster. So you're the best one here. Cute. And you are? Atta delta, but faster. Are you the best? Hi. There's no end to this. Nope. You know what keeps me up at night? What? What if we all started a hike together? We'd end up alone and some of us may never be found. Never reaching the bottom. So Sunday. Squeeze time. Okay, this is gonna be a fun one. What are the primary roles of the loss function and optimizer in the training of neural networks? A, the loss function computes gradients while the optimizer quantifies the model's performance. B, the optimizer determines the model's architecture while the loss function guides parameter updates during training. C, the loss function quantifies the model's performance while the optimizer adjusts the model parameters during training. Or D, the loss function is responsible for hyperparameter tuning while the optimizer computes model predictions. Comment your answer down below and let's have a discussion. And once again, if you do think I deserve it, please do consider giving this video a like and it'll help me out a lot. Now that's gonna do it for the final quiz and pass three of this explanation. And before we go though, let's generate a summary. Back propagation is used to compute the gradient of loss with respect to each parameter of the neural network. Optimizers define how we use this gradient to actually update the parameters. There are many types of optimization algorithms and each algorithm overcomes some shortcoming of a preceding algorithm. Now that's all we have for today. Thank you all so much for watching and like I mentioned before, if you do like this video, please do consider giving it a like. If you're interested in getting more information on optimizers in a more relatable way and you're not afraid of some math, do check out this video where we get into some details. Thank you all so much once again and I will see you in the next one. Bye-bye.