 Why use residual connections in transformers? In transformers, you'll notice some connections that skip layers. We have these to address vanishing gradients. In deep neural nets, we update parameter weights by back propagation, so we'll be updating weights depending on the gradient of the activation function. For many common activation functions, activation values are close to zero or near zero, and hence so are their gradients. And if a gradient is zero, there is no parameter updates that happen in the model, and so the model doesn't learn. Residual connections or skip connections allow information to bypass layers and prevent it from diminishing to zero, allowing information flow to be unimpeded. Hence, training really deep neural networks become tractable with residual connections.