 This is the code for layer normalization. Where we define a constructor parameter shape is to take in the dimensions along which we want to compute layer normalization. We define gamma and beta as two learnable parameters for every layer. During the actual forward pass of layer normalization, we compute the mean for every single layer. We then compute the standard deviation for every single layer. We then take every neuron, subtract the mean and divide by the standard deviation. Where to the standard deviation, we would have applied some epsilon turn to prevent a division by zero error. Then for every neuron, we will multiply it by the layers gamma value and add the layers beta value that would change over time during training. Because of layer normalization, we'll have a set of values that are now surrounding zero. And for every layer, they'll have a unit standard deviation for more stable training.