 Let's take an interesting problem of an NBA predictor determine whether a person will make it to the NBA based on their height and age Okay, Alan Iverson This guy is gonna make it to the NBA. Got you. Okay Yannis onto to Kumpo This guy also makes it to the NBA. Okay. Let me process this Okay LeBron James this dude makes it to the NBA Okay, let me process that Peter This guy doesn't make it to the NBA. Okay, let me process this even though it's obvious and Cool. Okay, then now for another 2,000 examples So this neural network is learning and it will learn correctly in minimizing the loss But training seems pretty slow Let's normalize their input data. This makes sure that the age and height are between the same small fixed range and Let's see what happens Alan Iverson makes it. Okay, I get it. Yannis makes it. Okay. Got you LeBron makes it. Okay Peter Griffith doesn't make it. Okay, I Need another 50 examples now This network learned much faster reaching the same loss minimum So if normalizing the input actually works, then why not normalize the activations in every layer with batch normalization? AI makes it. Yannis makes it LeBron makes it. Peter doesn't make it. I'm done time to test This is even faster during training time and it's called batch normalization because we Normalize the values with respect to the batch of inputs This is the overarching principle, but I'm going to explain this in detail with the math later on But first let's ease into the technical stuff Why are we doing batch normalization? Well, there are three main reasons It increases the speed of training like we've seen it decreases the importance of weight initialization And it regularizes the model a little bit. So let's go through each one Batch normalization increases the speed of training. How is this the case? Well, first up, how does normalization in general speed of training in our NBA model? We had two features height and age Feeding these values directly to our network will make our loss look like this. It's elongated This is because the height has a small range of like 0 to 2.5 meters and age has a larger range up to like 150 years Small variations in height can greatly change the loss So in order to effectively learn with gradient descent, we need a small learning rate to ensure we don't overshoot the minimum Normalizing data helps to Attracting the mean and dividing by standard deviation with this our data will have a zero mean as standard deviation of one and Our cost function looks more symmetric. We can now use much larger learning rates getting to the minimum faster Hence normalization speeds up training That said we could also use adaptive learning rate optimizers like Adam where we have one learning rate for age and another learning rate for height However, it's still better practice to normalize your data before feening it into your network So, yeah, normalization speeds up training your free network Small caveat to this though This contour plot you see how the lines are closer to each other in some cases and further away in other cases The closer lines indicate steeper edges while the lines that are further away indicate a flat area Both of these make learning very difficult because they're harder to traverse Batch normalization has the effect of smoothing out this terrain making the concentric rings more evenly spread and this makes it much easier to traverse This effect of loss smoothing is the main reason why batch normalization works fast and works so well Now from the next point batch normalization allows suboptimal starts in the loss function without batch normalization We start somewhere out here It might take us a hundred iterations or so to get to the minimum But if we start somewhere further out, it might take us a thousand iterations to get to the minimum We can't easily define the initial value since its range is so large But look at the new loss function we can now randomly sample a number between like negative one and positive one for an initial start to a feature and No matter where we start we will end up at the minimum in a similar number of iterations So batch normalization helps make initial weights less important Batch normalization acts as a regularizer when you think of regularization in neural nets you think of dropout Multiplying the activations of neurons with either zero or one to randomly turn off neurons With batch normalization 2 there is an element of randomness the mean invariance for every neuron activation These values are highly dependent on the batch a batch composed of random samples and In that regard batch normalization does induce some regularization Despite this we still use batch normalization with dropout for better results So these are the three main reasons why batch normalization helps model performance Some of these reasons and the overview of batch normalization will probably become clearer when we get into the details So let's do that These are the equations in the original paper for batch normalization. Let's pick these apart We train the network with mini batch gradient descent This means the parameters are updated only after we see a batch of some M samples Let's focus on this neuron Assume the batch size is three. So that means we update the parameters after seeing three samples Passing the first sample the neuron has an activation of four The second sample now that activation of the neuron is seven and the third sample the activation is five The mean activation for this batch is five point three three Pretty easy math and the variance is one point five five five five five five five five you get it Now we normalize the activation values by subtracting the mean and dividing by the standard deviation And we have this noise parameter in the denominator in case the variance becomes zero In other words the activation values of the batch now have a zero mean and a unit standard deviation This speeds up training But we can't just leave things like this the mean and variance are heavily dependent on the samples of the batch and So we calibrate the mean and the variance by introducing two learnable parameters for each neuron gamma and beta Training over multiple batches Gamma should approximate to the true mean of the neuron activation and in beta should approximate to the true variance of The neuron activation and so we get activations that not only speed up execution, but they also give us good results Hope all of that stuff was clear But the question why does batch normalization work is still not completely understood even today The initial paper said that it was because of its effect on reducing something called internal covariance shift This has been debunked in more recent papers Stating that it's mostly because of the smoother terrain of the loss function, which I mentioned previously But still the research is ongoing Even then I hope this video helped you get some clarity on the matter and I'll include some interesting reads in the description down below Hopefully the main paper is now more accessible Click on my other videos for some amazing explanations Subscribe and I'll see you soon. Bye