 What is layer normalization? This is a neural network. The output of each neuron is an activation. When looking at different examples during training, these activation values can get really large in magnitude. This could lead to large gradient steps and hence unstable training. To mitigate this, we would normalize these values by subtracting the mean and dividing a standard deviation. Specifically, for layer normalization, we subtract the mean of the layer and divide the standard deviation of the layer. The activation values now have a small range typically centered around zero. Layer normalization is used in transformer neural networks for natural language tasks for stable training. Because now, the gradient steps during the back propagation step don't become too large.