 Greetings, fellow learners! Before we get into this brazen world of batch normalization, I have a thought-provoking question for you. Do you like hiking? And if so, do you like one of those irregular hikes with steep inclines or then just deep troughs? Or do you just like a chill trek just gradually going up? Or do you just not like hiking at all? I personally love hiking with some challenges in between, but please do let me know what you think down in the comments below and I would love to know your hiking endeavors better. Also, maybe come back to this question after watching this video just to see how hiking relates to everything we're gonna be talking about here. Now this video is going to be divided into three passes starting with an overview, then some details, and then code. So let's get to it. Let's motivate batch normalization in this pass. So this is a feed-forward neural network and let's say we want to train this network to take in an input image and determine if it is a dog or not a dog. And during the training phase, we feed hundreds of image plus label pairs. After each iteration, the network parameters are updated and the model learns. Now let's try visualizing this. This is a contour plot of just two parameters of the neural network. In reality, the network has possibly hundreds or thousands of these parameters. Now consider this contour plot like seeing a mountain from a bird's eye view. So the blue and the green are the low parts of the mountain and then the red and orange are the high parts of the mountain. So blue and green are low loss whereas the red is high loss. Now say we initialize the network with some parameter values. So effectively on the contour plot, we might be starting here somewhere pretty high up the mountain. Now we perform one iteration of training by showing an example to the network. So the parameters are updated and the loss changes which we reflect on the contour plot. Now notice that as training continues, the loss eventually converges to the blue and green parts which it should. After all this means that the loss is decreasing with training. But you'll notice that it's kind of pretty zigzaggy and this is mostly because this contour plot is very stretched along parameter one direction and a little bit more compressed along the parameter two direction. Now this isn't great because that means that a small change in parameter two can easily overshoot the minimum point. And also small changes in parameter one don't even affect the loss at times. So that's why hence there we have a zigzag pattern. Now to resolve this, we can normalize each neuron output in the neural network. And if we were to use batch normalization during training, maybe our contour plot looks more like this. Now this contour plot represents a more smooth terrain. So during the training phase, we might start let's say here. And then after every iteration, the jump in this loss terrain converges more cleanly to the lowest point. And so training is more stable. And it is also more quick. Quiz time. Have you been paying attention? Let's quiz you to find out what are the problems with this contour plot. A possible unstable training as small changes in parameter one can lead to large jumps in loss. B possible unstable training as small changes in parameter two can lead to large jumps in loss. C possible unstable training as small changes in both parameters one and two can lead to a large jumps in loss. Or D there is no problem. And we should have stable training. Comment your answer down below and let's have a discussion. And if you think I deserve it at this point, please do consider giving this video a like because I will really appreciate it. That'll do it for quiz time and past one for now. But keep paying attention because I will be back to quiz you. So let's go through more technical definitions of batch normalization. Why do the contour plots get stretchy for some specific parameters compared to others? Well, the reason is complicated. But one reason it happens is because of internal covariance shift. So let's talk about it. This is a neural network and we want to train it to recognize dog images. So in the first iteration, let's say that we pass an image and let's say this hidden layer top neuron activation, it's five. And a few iterations pass. Now in iteration number 10, we pass an image and the top neuron activation is let's say it's eight. A few more iterations pass. Let's say that iteration 100, we pass an image and the top neuron activation here is 18. Now for each of these three cases, the distribution of the output of the neuron is different, though we're still measuring the output of the same neuron. So if we train the neural network, we calculate all the activation values of this specific neuron. And we might see that the variance of those values might actually be pretty large. And if the variance of these values are large, that means small changes to the parameters of the network can cause large changes to the neuron output. This in turn can cause large changes to the output of the neural network, which can in turn cause large changes to the loss. So that means even if like the parameters changed by a little bit, some small change, this could potentially cause very large changes in the loss. This is why the contour plot appears stretched out because small changes in parameter two can potentially cause large changes in the loss. And as we saw in the previous case, this can lead to training instability. Now to correct this, we can perform batch normalization. Batch normalization operates across a batch of samples. So neural networks in practice can take multiple samples at once and make the predictions for all of these in parallel. Now let's say that the batch size is three. This means that we can pass through the network will can pass three images. And we can get three activations for that specific neuron at the same time. Now this visual over here, I just took a screenshot from the main paper of batch normalization, and it tells us the map that's involved in what happens. So we would first compute the mean across the batch, then we would compute the variance for this batch. We would then normalize these values by subtracting the mean and dividing by the standard deviation. Each neuron will have two learnable parameters associated with it, gamma and beta. So we multiply the output value by gamma and then add a beta value to it. And that'll effectively scale and shift these values. For a more detailed math explanation, I highly recommend checking out this video that I made on batch normalization to just know that the overall outcome of applying batch normalization is that the neuron distribution output values, they had high variants, they are now squished. So they vary less. And this means that changing parameter values doesn't change the activation itself too much. And this has a cascading effect on loss. So low variance activation of neurons means low variance output of the neural network, which means low variance in the change in loss. And that's why the contour plot looks more smooth like this. Quiz time. It's that time of video again. Have you been paying attention? Let's quiz you to find out. Gamma will learn to approximate blank and beta will learn to approximate blank of neuron activations. A, the first blank is true mean and the second blank is true variance. B, the first blank is true variance and the second blank is true mean or C, neither of the above. Comment your answer down below and let's have a discussion. And that's going to do for quiz time number two and also pass to us the explanation, but keep paying attention because I will be back to quiz you. Now for pass three, let's actually take a look at batch normalization in code. We'll compare the performance of networks with and without batch normalization. So we first start out with importing torch libraries. So in this case, we're going to be using the MNIST data set, which is a data set of images. We want to normalize the input values passing in a mean and standard deviation. We'll then load the data set defining the batch size here as 64. We will then define a neural network here. So in defining a neural network, we are going to extend the class torch modules. And because we are extending this class, we need to override the function forward. And in this case, we are also overriding the constructor. So the constructor is going to define different components of this neural network. So in this case, we have an input layer hidden layer and output layer over here, where this is going to be the image dimensions 28 by 28. We have some hidden layers over here. And then we are going to output a neural network layer of just size 10, because it's going to be a classification of the images into digits zero through nine. We then define batch normalization layers here. So this use batch norm is going to be a Boolean value. If it is true, we are going to use batch normalization layers. If it is not true, then we're not going to use batch normalization. So the batch norm layers will then take in an input of how many units to actually include. In this case, we want this batch norm layer eventually to follow this FC one layer of 512 and batch to normalization layer to follow FC two of size 256. Hence, these parameters are chosen accordingly. Now we use each of these components in the forward pass of our neural network. So first of all, we're going to take the input image through the network and flatten it. We are then going to pass it through the first layer followed by an activation function. And if the batch norm is true, that means we're going to use a batch normalization layer. And we add it here. And we add it here. We'll then pass it through the layer FC two followed by a relu activation and then apply batch normalization. Once again, if required, we'll then pass it through the output layer. And this should be a vector of size 10 for every single example in the batch. And we will return the output over here. Next, we define a function to train this model. So for every single epoch, we first zero out the gradients, we're going to make a prediction for the batch of inputs, we'll get a batch of outputs, we then define the loss criterion over here, which you could see down under it's going to be a cross entropy loss. We will then use the loss dot backward function, which is going to perform back propagation in order to compute the gradient values. And then optimizer step is going to use the optimizer. So the optimizer here is going to be an algorithm that updates the neural network parameters, we are going to be using the optimizer called stochastic gradient descent. And we pass in the model parameters itself along with their initial learning rate. And also this is going to be stochastic gradient descent using the concept of momentum. Now, if you want to know exactly how momentum works, I have defined this in my video of right over here. So do check it out on optimizers. It's a pretty fun watch. So once we update the parameters of the neural network, we're going to display the loss itself. Next, you'll see that we define a neural network with no batch normalization, passing in the loss function along with the optimizer, and then we perform the training. We do the same exact thing, but with batch normalization layers passing use batch norm is true, passing in a loss where it's a cross entropy loss stochastic gradient descent with momentum. And then we will also train this model as well. So we're training one model with no batch normalization, where the loss results are planted here, and then one with batch normalization with its loss results printed over here. Now with the training phase itself, you can kind of see that after a few of these iterations, the loss values that you see without batch normalization are actually higher for the corresponding with batch normalization, which just means that the training looks like it's happening faster with batch normalization. And now let's actually perform some evaluation over here, which we load the endless data set once again, and we are going to get the model predictions, we'll then try to these predictions are going to be probability values between zero and one, which we want to just actually map those to classification values, which we do right over here, and determine how many of the cases we got correct versus not correct in terms of classifying whether the image was recognized correctly. And you can see that the accuracy with batch normalization slightly outperforms that without batch normalization. So overall, you'll see that the convergence was slightly faster with better performance, but do note that this results and comparison can vary as you change the architecture of your neural network, as well as the amount of data that you have in your training set. Quiz time. Okay, this is going to be a fun one. Which of the following does batch normalization not address a improved accuracy, be faster convergence, see decreased model complexity, or D decreasing the importance of initial weights. Comment your answer down below and let's have a discussion. And once again, if you think I do deserve it, please do consider giving this video a like because it will help me out a lot. Now that's going to do it for quiz three and past three of this explanation. But before we go, let's generate a summary. Neural networks can process batches of data in parallel. Now batch normalization is a technique neural networks use to speed up training, make training more stable, and reduce internal covariant shift. And that's all we have for today. Now the code for this video is linked down in the description below. And this code is present with all other videos in this deep learning one on one playlist. And also to continue your understanding of batch normalization, I highly recommend you check out this video right over here. It's a nice supplementary video. So thank you all so much for watching. If you do think I deserve it, please do give this video a like subscribe and I will see you in the next one. Bye bye.