 Why use minibatch gradient descent in transformers? It's for faster training. When training a neural network with gradient descent, we pass a single input, generate an output for prediction, compare the prediction and the label, and then quantify this as a loss. This loss is then back-propagated to update the parameters of the network. And so for every training example passed into the network, the parameters are updated. This is fine for small networks since the updates are pretty quick, but for networks like the transformer with millions of parameters, the updates are slow. So instead of updating the weights after every example, we will update the weights after seeing a batch of examples. And hence, minibatch gradient descent is of common practice. If you want to build a transformer, check the playlist on the channel.