 the first variation of gradient descent that we'll look at is minibatching. Minibatch is a super simple idea. Instead of taking a gradient by averaging over all of the perhaps millions of observations in the training set to using a standard gradient descent, or instead of taking a single point and doing a stochastic gradient descent where we move in the direction that reduces the error for that one point, what we'll do is group together a set of k, maybe 30, maybe 50, maybe 100 points in a minibatch. We will then average the gradients in the over those 50 points, say, and take a step that then decreases the error over those 50 points and repeat that multiple times. This gives an online learning algorithm that works extremely well. Basically, all deep learning is minibatching. The equations super simple. We modify our old weight minus a learning rate instead of taking the gradient with respect to either the average change in loss function with respect to each weight for a single point. We'll average it over the, say, 50 points in the batch b, average there, 50 number points. And what it says basically is update all the weights, vector update, by subtracting off the average gradient of the loss function with respect to those points over the 50 points on minibatch. That's it. So give it a try. See how it works. What you will look for, I hope, is to see how does the computation accuracy speed tradeoff change as you change the minibatch size.