 Let's compute layer normalization by hand for these two layers. The output of every neuron activation is shown. Take the mean of each layer, determine the standard deviation of each layer. For every neuron in that layer, we will subtract the corresponding layer mean and divide it by the layers of standard deviation. You'll notice that the values across every layer will now have a mean zero and about a unit standard deviation. And for each layer, we apply two learnable parameters, gamma and beta. Since they are learnable, that means that their values will change over training time. We multiply each activation of every neuron by the layers gamma and also add the layers beta. This is so that activations are comparable across different samples. Layer normalization overall leads to stable training.